Wire #10 — The week agent reliability became an infrastructure problem
Agents don't break in the model—they break in the harness, memory, latency, and adoption mandates that CTOs are signing without measuring.
Transcript
Six weeks. That's how long Anthropic itself took to track three overlapping changes that brought down Claude Code.
And if the team that trained the model took six weeks to find it—your team won't find it in a live agent.
This is the ai|expert Wire. The day it became clear: an agent in production doesn't break in the model—it breaks in the harness, memory, latency, and the mandate someone signed without measuring.
On April 23, Anthropic published a postmortem. Three separate product changes, overlapping, affecting different cohorts in different time windows—creating the appearance of broad, inconsistent degradation.
The model didn't regress. The API weights stayed stable. What happened: on March 4, Claude Code's default reasoning effort dropped from "high" to "medium" to prevent UI freezes. It stayed that way for 33 days. On March 26, a cache bug made the agent lose its own reasoning history on every turn after an hour idle—a user with 900 thousand tokens in context who paused for an hour triggered a complete cache miss on the next message. On April 16, a verbosity cap limited responses to 100 words. Internal test: no regression. Investigation test: 3% drop in code evals.
Stella Laurenzo, from AMD, analyzed 6,852 Claude Code session files, 17,871 reasoning blocks, and 234,760 tool calls. What she found: the reads-per-edit ratio dropped from 6.6 to 2.0—the agent stopped researching before editing.
That's not model failure. That's system failure. And what Anthropic's postmortem leaves implicit, a paper published the same week makes explicit.
The paper is AI Harness Engineering, by Hailin Zhong and Shengxin Zhu. The core thesis: software agents fail not because the model lacks capacity—they fail because the runtime infrastructure to verify and attribute output simply doesn't exist.
They name eleven component responsibilities: task specification, context selection, tool access, project memory, task state, observability, failure attribution, verification, permissions, entropy audit, and intervention logging. And they propose a four-level ladder, H0 to H3.
H0 is where most live production setups live today. Task in, patch out, no runtime support. H3 is full coverage: reproduction logs, deterministic requirement checks, structured verification reports.
Anthropic's postmortem is an H0 case turned H1 in a rush. Nobody instrumented reasoning effort by session before rollout. Nobody measured reads-per-edit as a health indicator. Anthropic found the causes—but it took six weeks.
Latency is the next layer. Researchers at UC Berkeley published a framework called Speculative Interaction Agents. The problem they attack: in standard agentic workflows, the agent waits for the user to finish speaking before reasoning, and pauses reasoning while tool calls execute.
For voice, sub-second is the requirement. Multi-turn tool calling adds several seconds of latency on top of inference time. The framework uses two mechanisms: async I/O—which decouples the reasoning loop from input and environment streams—and Speculative Tool Calling, which fires low-risk calls before the user finishes specifying parameters.
Results: 1.3 to 1.7 times faster on cloud APIs like OpenAI Realtime and Gemini Live, with no model change. On edge models like Qwen2.5 and Llama-3.2, they hit 2.2x.
The hidden cost: every tool must be classified as speculatively safe or pending confirmation. That's not automated. It's the integration tax that the announcement doesn't mention—and it falls on the engineering team, not the vendor.
The security layer has its own problem. Microsoft Research generated 30 thousand adversarial strategies from 2,500 Wikipedia articles—and showed that sufficiently implausible attacks pierce guardrails that block all conventional manipulation.
The paper's examples are instructive for their absurdity: a fake international treaty—"the Geneva Convention on Coffee requires no more than two dollars per bean"—a manufactured climate emergency, a spurious technical constraint. A human seller would reject all three. The agent accepted the premises and adjusted behavior.
Why? Because the entire safety pipeline—pretraining data, RLHF reward models, human red-team—is calibrated to human judgment of threats. Attacks humans wouldn't attempt rarely appear in the training signal.
Claude Sonnet 4.5 proved nearly immune to direct prompt injection. But in multi-agent routing environments, even GPT-5 failed. A single malicious message propagated through over a hundred agents, consumed over a hundred LLM calls, and circulated for more than twelve minutes.
Resistance to direct prompt injection doesn't generalize to agentic routing graphs.
The last piece this week: what if the agent could learn from its own failures—without retraining?
That's what FORGE does. Researchers from Carleton University, Defence R&D Canada, and Cistel Technology published a protocol where agents evolve memory through reflection, without weight updates, without fine-tuning, without distillation from a stronger model. On CybORG CAGE-2, a network defense task with 30 steps and partial observability, FORGE reduced severe failure events to approximately 1% and improved 1.7 to 7.7 times over zero-shot baselines.
The critical mechanism, according to the ablations, is population broadcast. The best-performing agent distributes its memory artifact to all other agents in the population. Removing the broadcast collapses results to standard Reflexion level.
Rules Format—textual heuristics—uses 40% fewer tokens than few-shot examples, with modest loss in accuracy. For high-throughput pipelines, it's the right trade. And the strategic detail: weaker models benefited more from FORGE than stronger models. It narrows capacity gaps. It doesn't amplify what already exists.
The full picture from the first block is this: the agent in production breaks in the harness, latency, guardrails calibrated for humans, and memory that goes missing between sessions. None of those layers are on the vendor roadmap. All are the responsibility of whoever deployed it.
While engineering faces these layers, the market sends another signal.
Cisco reported $15.84 billion in revenue for the third fiscal quarter of 2026—the largest quarter in the company's 41-year history. Stock rose 15% in after-hours trading. And announced fewer than four thousand job cuts in the same announcement.
AI infrastructure orders total $9 billion in the fiscal year—revised up from $5 billion, which came from an original target of $1 billion. Microsoft Azure, Google Cloud, Amazon Web Services, and Meta are the named hyperscalers. Networking revenue grew 25% to $8.82 billion.
Gross margin fell to 66%, a drop of approximately three percentage points. Heavy AI hardware carries lower margins than software. Cisco won the current silicon cycle—but won with tighter margins. And the 3,800 laid off fund the next bet.
404 Media published reports of developers at Amazon, Google, Microsoft, and fintechs operating under explicit mandate to use AI in code—regardless of output quality or security impact.
An Amazon employee started inflating reported AI usage to meet adoption metrics. A developer said directly: "The actual quality of output doesn't matter as much as our willingness to participate." Prompts logged in compliance, output discarded, code written by hand.
Google reports 75% AI-generated code. Anthropic, 90%. Microsoft's CTO expects 95% of all company code to be AI-generated by 2030.
These percentages measure mandate compliance—not productivity, not quality. An API security firm tracked a tenfold increase in monthly findings within Fortune 50 companies between December 2024 and June 2025: from one thousand to more than ten thousand vulnerabilities per month. GitClear analyzed 211 million lines of code: duplicate code rose from 8.3% to 12.3% of all changes. Refactoring activity dropped from 25% to less than 10%.
A randomized controlled study with 52 engineers: participants using AI completed the task in similar time—but scored 17% lower on a subsequent comprehension quiz. Fifty percent versus 67%.
The problem isn't adoption. It's adoption metric. Percentage of AI-code measures mandate strength—not engineering health.
The counterpoint exists. And it's instructive precisely because the workflow was designed, not imposed.
Paulo Arruda, staff engineer at Shopify, presented at QCon AI how he replaced a monolithic prompt with a swarm of specialized Claude Code agents. The theme review process of 22 hours dropped to 7 to 20 minutes. Evaluation of internal candidate moves: from hours to less than an hour. A swarm of 15 parallel Claude Code instances for internal documentation mining. The Claude Swarm project has over 1,400 stars on GitHub.
The Shopify lesson is about decomposition. Each review criterion became an agent with single responsibility. Single prompt with multiple responsibilities—the LLM gets lost.
The largest gains—65 to 190 times—appear when the bottleneck was human throughput, not LLM latency. Teams expecting that multiple in every case will be disappointed in most of them. The difference between Shopify and forced mandates isn't the tool. It's the design.
The throughline this week is one: agent in production is a system. Harness, memory, latency, guardrails, and the metric that defines success for whoever signed the mandate. Each of those layers can be root cause. None appears in the benchmark.
Friday, in Edition, we open the trust budget: how much can an agent fail before it comes out of production—and who signs off. Until then. Good week.