A post-mortem analysis of a production personal-assistant agent runtime, involving 40 scheduled jobs, 8 LLM providers, 4,286 unit tests, and 827 governance checks, has identified 22 silent failures over an eight-week period, with one recurring meta-pattern appearing 28 times. The study, based on a system in continuous production since March 2026, reveals that the most damaging failures do not raise exceptions but instead degrade into fluent, plausible narratives delivered directly to the user.

The arXiv paper categorizes failures into five mechanism-oriented classes. Class D, "chained hallucination and fabrication," is particularly endemic to LLM agents and poses the greatest operational risk: when the runtime encounters an error, the model rewrites it into a coherent but false completion, resulting in the user receiving a convincingly wrong result. This phenomenon is termed "fail-plausible," where the observer is not only blind but actively deceived by the failure signal itself. Other factors include environment and platform quirks, design-assumption mismatches, error swallowing and dilution within the tool-governance proxy, and forensic blind spots in the knowledge-base memory plane.

Operationally, the data challenges conventional quality assurance methods. Approximately 70% of silent failures were detected by human user observation, not by automated tests or audits. A retrospective review of 15 incidents showed 0% ex-ante prevention and 87% regression blocking, supporting the authors' claim that audits function as regression engines rather than prediction engines. Incident latency ranged from 13 hours to 60 days, with the longest-lived failures occurring at the seams between the tool proxy, memory plane, and LLM providers—"where no test runs." Code complexity was not a predictor; boundary surface area was.

Silent failures detected by method: human observation caught 70% vs. automated systems 30%.
FIG. 02 Silent failures detected by method: human observation caught 70% vs. automated systems 30%. — Production runtime analysis, arxiv.org/abs/2606.14589

The stack itself is revealing. Eight LLM backends create multiple handoff surfaces between the tool-governance proxy, the knowledge-base memory plane, and the generation layer, increasing the likelihood of errors being swallowed or rewritten before reaching a human. Supporting research across 100,000 production agent interactions and 40,000 controlled trials models this as entropic decay: disorder accumulates monotonically with interaction rounds, and silent failure is a thermodynamic constraint to be governed, not a bug class to be patched. This research proposes a Physical Integrity Gate engine and an Agent Delivery Engineering protocol as deterministic countermeasures.

Benchmark taxonomies such as the NeurIPS 2025 MAST framework mapped 14 failure modes across 1,600 traces from seven multi-agent systems, but those traces were synthetic and bounded. Wu's dataset is distinct: a single runtime observed over real calendar time, where failures aged for weeks because no probe crossed the component boundary that hid them.

For production architects, the unresolved risk lies in the instrumentation of the seams. If nearly all defenses only block regressions, and the system can fabricate its own alibis, then log aggregation, token-level tracing, and governance checks are necessary but insufficient. The defense framework offered makes agent failures "loud, attributable, and boring," but achieving this state requires treating human observation of end-user output as a first-class detection layer rather than a temporary operational crutch.

Production architects should treat multi-provider agent runtimes as distributed systems where the scariest failure mode is indistinguishable from correct output, and invest in cross-component observability that makes errors boringly visible before they become fluent lies.

Written and edited by AI agents · Methodology