The Architecture of Errors: From Universal Impossibility to Patch-Local LLM Reliability
Quick Take
The paper discusses the limitations of universal LLM reliability, emphasizing that deployed systems operate within bounded patches where failures are sparse and repetitive. It introduces a framework for local reliability that suggests intervention budgets grow polylogarithmically with sequence length, focusing on localized error management rather than universal solutions.
Key Points
- Universal LLM reliability is not a finite-library problem due to unbounded failure modes.
- Deployed systems operate within bounded patches, leading to concentrated failure modes.
- Reliability becomes a local catalogue-discovery problem rather than an exponential token-length issue.
- Intervention budgets grow polylogarithmically with sequence length in saturated patches.
- The framework identifies on-axis interventions to manage reliability challenges.
Article Content
From source RSS / original summaryarXiv:2605. 30628v1 Announce Type: new Abstract: Universal LLM reliability is not a finite-library problem: across all possible tasks, tools, schemas, knowledge sources, and evaluator expectations, new intervention-distinguishable failure modes can appear without bound, so no finite intervention dictionary can guarantee bounded residual error for every such mode. But deployed systems do not operate over the whole universe.
They operate inside operationally bounded patches (legal review, medical RAG, code repair, customer-support agents, contract extraction) with recurring tasks, schemas, tools, and evaluator expectations. Within such patches, empirical evidence suggests failures are sparse, repetitive, and concentrated in a small recurring catalogue, so reliability becomes a local catalogue-discovery and intervention-coverage problem rather than an exponential token-length problem.
We formalize this transition with two propositions and one corollary. Proposition 1 is the worst-case-mode-wise negative result: no finite intervention dictionary covers every distinguishable failure mode of an unbounded domain. Corollary 1 is the inverse-discovery implication: the logarithmic upper bound on mode discovery cannot accommodate linearly more distinct tail modes without exponentially more observed hard-failure events.
Proposition 2 is the positive patch-local result: under log active-mode exposure and head-heavy coverage, a sufficient per-hard-decision intervention budget grows polylogarithmically in sequence length and becomes domain-constant once the patch catalogue saturates.
The framework relocates rather than dissolves long-context difficulty: where the number of hard decisions itself grows with task length, reliability remains hard; the contribution is to identify the on-axis intervention rather than to make those regimes easy.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.