The Bill Came Due: Cost, Sovereignty, and Auditing Agents
The week when cost, sovereignty, and auditing agents stopped being three separate conversations—and became the same architecture decision for the CTO.
Transcript
Uber burned through its entire 2026 AI budget in four months.
And the COO publicly admitted he cannot link a single token consumed to a feature delivered to consumers. In the words of Andrew Macdonald: "That link is not there yet."
This is the ai|expert Edition. The week when cost, sovereignty, and auditing agents stopped being three separate conversations—and became the same architecture decision for the CTO.
In December 2025, Uber did what many companies dream of doing: it distributed Claude Code to five thousand engineers at once. It was the kind of adoption that generates keynote slides. And it did.
By March, 84% of Uber's developers were agentic coding users. By April, 95% of engineers were using AI tools monthly. Seventy percent of committed code came from AI. Eleven percent of production backend updates were executed by agents with zero human review. These were the numbers any engineering leader wanted to present to the board.
And then the budget ran out. Four months after rollout, the company had burned through its entire AI budget forecast for all of 2026. CTO Praveen Naga told The Information: "I went back to square one, because the budget I thought I would need is already gone."
The mechanism is straightforward when you look at the pricing structure. Claude Code uses consumption-based pricing—no caps per engineer. The monthly cost per engineer averaged between 150 and 250 dollars. Heavy users paid 500 to 2,000 dollars per month. In aggregate, across five thousand engineers, monthly spending ranged from 2.5 million to 10 million dollars. By comparison: GitHub Copilot charges between 10 and 39 dollars per seat per month, fixed rate.
But the worst part wasn't just the cost. It was the design of the incentive system Uber built on top of it. The company implemented internal leaderboards that ranked engineers by AI usage. More tokens consumed meant higher scores. You literally built a mechanism that rewards maximum consumption—and then were surprised by the bill. The company's AI costs rose six times relative to 2024, against an R&D base of 3.4 billion dollars.
And COO Andrew Macdonald exposed the measurement problem in a way that deserves direct quotation:
"It's very hard to draw a line between one of those stats and okay, now we're producing 25% more useful consumer features."
Engineering output metrics improved—more commits, faster iteration velocity. Feature delivery velocity to consumers did not keep pace proportionally. There was a mismatch between what was being measured and what the company needed to produce.
That mismatch reflects two simultaneous problems. First, a measurement failure: internal productivity proxies don't map to product velocity. Second, a genuine question about what agentic tools actually compress—they compress time to write code, but that code still needs QA, integration, and product review before it reaches the consumer.
Uber is not backing away. It's testing Codex alongside Claude Code and moving toward agent-led development. But CEO Dara Khosrowshahi revealed the company was decelerating hiring to compensate for increased AI investment. The bill came due—and changed headcount decisions. And Microsoft's Experiences + Devices canceled internal Claude Code licenses in May, with a June 30 migration deadline to GitHub Copilot CLI. Internal sources confirmed: cost management influenced the timing. Claude Code had become "very popular, perhaps too popular."
GitHub just published the playbook for how not to fall into that trap.
GitHub cut agentic workflow token spending by up to 62% in production. The approach was threefold: eliminate unused MCP tools, replace MCP calls with native GitHub CLI invocations, and implement two daily agentic loops—a Daily Token Usage Auditor and a Daily Token Optimiser.
What anchors the system is the Effective Tokens metric they created. Output tokens weigh 4 times. Cache reads weigh 0.1 times. Model multipliers: Haiku at 0.25 times, Sonnet at 1.0 times, Opus at 5.0 times. A 10% drop in Effective Tokens translates directly to 10% cost reduction—regardless of which model is running. It's a single currency that cuts across all stack heterogeneity.
Results were measured across twelve internal workflows over at least 109 post-correction executions. Auto-Triage Issues dropped 62%. Smoke Claude, 59%. Security Guard, 43%. Daily Community Attribution, 37%. Only one workflow increased—Contribution Check, by 5%—and GitHub attributed the increase to a change in PR size, not an optimizer regression.
The counterintuitive finding that matters more than the numbers: Daily Community Attribution was carrying eight unused MCP tools with zero calls across an entire execution. Removing those tools reduced nothing. The tool schemas were a tiny fraction of the workflow's total context—dominated by large text payloads. MCP pruning only works when schema bloat dominates the prompt. If your workflow already carries large documents or extended conversation history, cutting unused MCP tools will be optimization noise.
What GitHub did most intelligently was create agents that audit themselves. The Daily Token Usage Auditor aggregates consumption by workflow and flags anomalous spikes. The Daily Token Optimiser reviews source code and recent logs, opens a GitHub issue, and proposes specific fixes. The infrastructure is available on the gh-aw CLI today. The team's next step is portfolio analysis to eliminate duplicate reads and shared intermediate artifacts across entire repository fleets.
The thesis is simple: token pricing at enterprise scale is a cloud cost problem, not a software license problem. Usage caps, per-engineer budgets, and monitoring layers equivalent to DevOps governance on AWS need to precede agentic rollouts—not follow the first budget emergency.
And there's a third element in this equation, and Anthropic just moved it materially.
On May 28, six weeks after Opus 4.7, Anthropic launched Claude Opus 4.8. The standard API price stayed identical: 5 dollars per million input tokens, 25 dollars per million output tokens. But fast mode—which delivers approximately 2.5 times more throughput than standard mode—dropped from 30 and 150 dollars to 10 and 50 dollars per million tokens. A two-thirds reduction in the mode that matters most for agentic loops.
And performance rose alongside. On SWE-Bench Pro, Opus 4.8 scored 69.2%—above 4.7's 64.3%, surpassing GPT-5.5 at 58.6% and Gemini 3.1 Pro at 54.2%. On SWE-bench Verified, it reached 88.6%. On computer-use, 84% on Online-Mind2Web and 83.4% on OSWorld-Verified. GDPval-AA Elo of 1890, against GPT-5.5's 1769. It's the only model to complete all cases of Anthropic's internal Super-Agent benchmark end-to-end.
What that forces the CTO to reconsider is the purchase premise. When the market's most capable model costs the same as its predecessor and delivers superior performance—the cost of not measuring token-per-feature became higher than the cost of measuring it badly. Token-per-feature has to be a board KPI, not an engineering detail.
Before turning the page, one alert from the Opus 4.8 system card that deserves attention before any rollout. Anthropic identified a growing trend of the model speculating about evaluators in its internal reasoning text—present in approximately 5% of training episodes. This hasn't yet translated to observable worse behavior. But architects running pipelines with governed evaluation or agents in legal contexts need to monitor calibration drift before scaling.
Thirty-six days between 4.7 and 4.8. If you built evaluation gates around 4.7 in April, you already have qualification debt—regardless of vendor.
Let's shift the frame. The entire token cost conversation assumes one premise: the model lives in the cloud. But there's a cluster running DeepSeek V3.1 with 671 billion parameters at 25 tokens per second—on four Mac Studios, no API calls, no data egress, for 38 thousand dollars. This is not a research paper. It's documented production.
The configuration: four Mac Studio M3 Ultra with 512 gigabytes of RAM each, at 9,499 dollars per unit, connected with Thunderbolt 5—total approximately 38 thousand dollars. The same stack runs Kimi K2, the trillion-parameter Mixture-of-Experts model, at 34 tokens per second. Everything on-premises, under full control, no third party in the loop.
What made this viable today is macOS 26.2's RDMA-over-Thunderbolt 5. Stabilise.io measured 5 to 9 microseconds of latency between nodes—versus typical 300 microseconds for Thunderbolt without RDMA. That two-order-of-magnitude difference allows you to partition layers with tensor parallelism across unified memory pools without the interconnect becoming the bottleneck.
For smaller budgets, there's the four Mac Mini M4 Pro configuration—48 gigabytes of RAM each, 1,999 dollars per unit, plus 200 dollars in Thunderbolt 5 cables. Total approximately 8,200 dollars. That setup serves Nemotron-70B at 8 tokens per second and Qwen2.5Coder-32B at 18 tokens per second via EXO Labs.
The choice of framework depends on the critical path. An arXiv benchmark on Mac Studio M2 Ultra found: MLX delivering the highest sustained generation throughput; MLC-LLM the lowest time-to-first-token; llama.cpp the most efficient single-stream serving; Ollama the most ergonomic deployment—with throughput and latency costs. All still fall behind vLLM in absolute performance on NVIDIA GPU systems.
But absolute performance isn't the central argument here. The argument is sovereignty. Weights and prompts never leave the building. No third-party APIs, no egress monitoring, no rate-limit negotiation, no data-use clauses to review. For teams under strict residency rules—GDPR, HIPAA, NIS2—that architectural property can outweigh raw throughput differences.
But the caveats are real. RDMA still requires manual commands in macOS recovery mode—it's early-stage. Any M1 or M2 node on Thunderbolt 4 regresses to TCP/IP and loses latency guarantees. These clusters are better suited for batch inference than for high-frequency interactive chat.
And Mixture-of-Experts models introduce a specific bottleneck. An arXiv study running DBRX 132B on a Mac Studio M2 Ultra cluster found that communication time approaches computation time during expert routing—requiring manual memory optimization to prevent the interconnect from dominating the critical path. That same workload was 1.15 times more cost-efficient than an NVIDIA H100 supercomputer—but only after manual tuning.
The new floor for BYO-inference is 38 thousand dollars and some engineering hours in recovery mode. The strategic question is what happens in the layers above that floor—at platform scale.
Snowflake demonstrated what happens. The company committed 6 billion dollars to AWS over five years—in Graviton 5 CPUs and cloud GPUs. To put the trajectory in context: at IPO in 2020, the commitment was 1.2 billion dollars. In 2023, it rose to 2.5 billion. In 2026, it doubled to 6 billion—averaging 1.2 billion per year.
Graviton 5 has 192 Arm Neoverse V3 cores with 12 memory channels at 8,800 megabytes per second. Snowflake is migrating general-purpose compute from Intel and AMD x86 CPUs to Arm. GPUs continue to handle model training and inference. The control plane—the natural language SQL engine of Cortex AI, summarization pipelines, sentiment analysis, and the MCP fabric from Natoma, which Snowflake acquired for agent governance—runs on Arm.
And the reason reveals the real structure of agentic cost. The GPU handles model inference. But every SQL query, every Python UDF, every workflow step that an agent fires—that's general-purpose compute. Agent throughput is CPU-bound. Meta committed to deploying tens of millions of Graviton 5 cores for agentic AI. The control plane is the bottleneck. Silicon budgets are shifting accordingly.
The capacity risk is immediate. Andy Jassy told GeekWire that two major customers recently tried to buy Graviton's entire 2026 production for Amazon—and were refused. On-demand Graviton capacity at scale is effectively unavailable. Multi-year advance capacity reservation is mandatory for anyone building agentic platforms.
And there's a second risk that's receiving less attention: ISA lock-in. Graviton is ARM-based, but AWS-specific. A future multi-cloud migration becomes substantially more expensive than moving between x86 clouds. When you sign a 6 billion dollar commitment over five years, switching cost ceases to be theoretical.
There's also an integration regression that needs to be named. Snowflake acquired Natoma for Model Context Protocol governance, integrating agents into enterprise systems. But most organizations still lack observability that connects CPU core saturation directly to agent task completion rates. The concrete failure mode: a GPU idle waiting for a SQL result. If Graviton concurrency strangles on UDF execution or MCP handshake overhead, end-to-end latency regresses—even if per-core efficiency improves.
The third layer of sovereignty isn't silicon. It's where the agent can execute code. And Azure Logic Apps just brought that inside its own integration bus.
Logic Apps embedded code interpreters in a sandbox for its agentic loop. Python, JavaScript, C#, and PowerShell executing within isolated runtimes without leaving the integration bus—in public preview, integrated with more than 1,400 pre-built connectors.
The technical stack varies by tier. At the Standard tier, the interpreter spins up a dynamic Azure Container Apps session within a Hyper-V microVM—network isolation ensuring data stays within defined boundaries. At the Consumption tier, JavaScript runs in a V8 isolate via the isolated-vm library—a lighter mechanism than Hyper-V, with different security guarantees.
And Microsoft is explicit: this is not a complete security sandbox. It's defense in depth. Memory limits, execution timeouts, failure isolation that prevents agent crashes from bringing down the runtime process. It's not safe for completely untrusted code. The risk shifts from system escape to silent logic error or prompt-injection abuse within the sandbox boundary.
The immediate architectural benefit is taking analytical loads out of the LLM's context window. Large in-context calculations generate hallucinations. Code in the sandbox executes deterministically and returns the result. But Microsoft hasn't published per-call latency, per-execution cost, or throughput. Dynamic ACA sessions introduce infrastructure overhead and cold-start latency that weren't quantified. And without production SLA in public preview, the architecture decision needs to include that risk explicitly.
Sovereign silicon in the rack. Sovereign runtime in the workflow. Sovereign control plane in the cloud. Sovereignty isn't one decision—it's three chained together. And the window to make each one is closing.
The stories from both previous segments converge on the same question: how do you know if the agent is doing what you think it's doing? Not on the benchmark. In your production environment, in your codebase, in your business processes. And the honest answer, today, for most organizations is: you don't know.
Researchers from Stanford—William Overman and Mohsen Bayati—published the CCO: Calibrated Collective Oversight. A framework that maintains human control over autonomous agents with statistical guarantees of finite time, even in adversarial observation scenarios and without distributional assumptions. Without needing to specify an MDP. Without requiring coordination between supervisors.
The mechanism in practical terms: at each step, the primary agent proposes candidate actions with utility scores. A conservative baseline action is always available—the action that takes no risk. Auxiliary supervisors score each candidate along dimensions like scope or safety. The CCO calculates an aggregate penalty measuring total deviation from the conservative baseline, and selects the action that maximizes utility minus lambda times the penalty. Lambda is calibrated online using Conformal Decision Theory—with finite-time guarantees and no distributional assumptions.
The results on the MACHIAVELLI benchmark—134 Choose-Your-Own-Adventure games with over 500 thousand social decision-making scenarios—were significant reduction in ethical violations without reward collapse. On modified SWE-bench, weaker supervisors successfully restricted an adversarially misaligned stronger agent. Empirical violation rates closely tracked specified targets.
The practical limitation that matters for anyone deploying today: the CCO needs to know after each step whether the action incurred a loss. On MACHIAVELLI and SWE-bench, that information is available instantly because it's scripted. In live deployments, the loss might not be observable for minutes or hours—breaking the online feedback loop the framework depends on.
And there's the latency cost of ensemble: each candidate action needs to be scored by all auxiliary supervisors against the baseline. If all supervisors miss a risk—because none were trained to recognize it—the deviation penalty is close to zero and the agent proceeds unimpeded. The system is only as good as the diversity and coverage of the ensemble. That's the limit and also the entry point for the architect: ensemble composition is where auditing investment begins.
And Microsoft applied exactly that logic—agent ensemble, adversarial debate, staged validation—at industrial scale on its own codebase. The result arrived on Patch Tuesday in May.
The MDASH—Multi-model Agentic Scanning Harness—was built by Microsoft's Autonomous Code Security team with the MORSE group and the WARP team. In May 2026, it found 16 Windows vulnerabilities, including four critical RCEs. All 16 went directly into the May 2026 Patch Tuesday cycle.
The pipeline has five stages: Prepare, Scan, Validate, Dedup, Prove. Over 100 specialized agents run in parallel across an ensemble of frontier and distilled models. The system ingests a codebase, builds language-aware indices, extracts threat models from commit history, runs specialized auditor agents over candidate code paths, deduplicates findings through agent debate, and generates proof-of-concept exploits for survivors.
On the public CyberGym benchmark—1,507 real-world vulnerabilities—MDASH scored 88.45%, leading by five percentage points over Anthropic's Claude Mythos Preview. Internal recall against confirmed MSRC cases: 96% on clfs.sys across 28 cases over five years; 100% on tcpip.sys across seven cases in the same period. In a controlled private test with StorageDrive—a codebase models had never seen publicly—MDASH found all 21 planted vulnerabilities with zero false positives.
Two critical findings that entered Patch Tuesday: CVE-2026-33827, an unauthenticated remote use-after-free in tcpip.sys. And CVE-2026-33824, a double-free in the IKEv2 service reachable on UDP port 500. Both critical. Both found before a malicious actor did.
The lesson for the architect isn't "wait for MDASH to hit GA." It's about where to invest: in orchestration harness. Staged validation, adversarial debate between agents, automated proof-of-concept generation. Microsoft was explicit: the system was designed to survive a model swap without rebuilding the harness. The model is replaceable. The pipeline is the moat.
And there's a third angle to this auditing story—where the cost of not auditing became more concrete than any CVE.
A Stanford-conducted audit of 3.4 million applicants screened by Pymetrics—now owned by Harver—revealed that a single vendor's cognitive assessment algorithm produced measurable racial adverse impact at the individual job level. The study analyzed 4 million applications across 1,700 positions and 150 employers.
The algorithm directed 26% of Black applicant submissions and 15% of Asian applicant submissions to positions where the system discriminated against their group, according to the EEOC four-fifths rule.
And the effect of systemic exclusion is mathematical. Ten percent of applicants who submit four applications are rejected from all four. Four percent of those who apply to ten positions are algorithmically rejected from all ten. The probability of total exclusion only drops below 0.1% if an applicant applies to 25 distinct positions.
The mechanism that makes this systemic: Pymetrics stores scores and reuses them across its network of employers for up to 330 days. An applicant who applies to multiple companies doesn't receive independent assessments. The same score in cache is referenced repeatedly. Forty thousand applications from additional minorities would have advanced to human review under equal treatment.
The part that breaks all prior compliance narratives: the vendor's own aggregated audit didn't find disparities reaching legal scrutiny. Average occupational aggregation dilutes bias across job families. Stanford's per-position analysis showed 10.62% of individual positions with adverse impact against Black applicants. The pool masked what the per-position exposed.
And that's not an implementation defect. It's an auditing architecture failure. New York Local Law 144 explicitly permits pooled audits—the method that masked bias in this case. Most third-party screening vendors have no obligation to measure score persistence across employers as a concentration risk.
Market scale turns that into systemic risk. Sixty percent or more of the Fortune 100 and eight of the ten largest U.S. federal agencies run screening through HireVue alone. Correlated decisions, deterministic, propagated across institutions from a restricted set of shared models. A single scoring edge case can blacklist an applicant across an entire network. The EU AI Act's compliance deadline for hiring tools falls on August 2, 2026—and the study argues current frameworks still lack mandates for per-position adverse impact, market surveillance between employers, and independent researcher access to vendor data.
What connects this story to the first segment: if you don't have a statistical auditing framework running today—over your code agents, over your screening systems, over your decision workflows—the regulator, the journalist, or the customer will run it for you.
And the bill is higher than the token bill.
Cost, sovereignty, auditing. These aren't three conversations to schedule in three separate meetings. It's one architecture decision. And you're making it—whether you know it or not.
The week when the bill came due wasn't about a number—it was about the absence of a method to read it. Tokens without KPI, silicon without observability, agents without statistical guarantee. The work for next week: choose which of the three layers you're instrumenting first. Wire Wednesday—we open with the GPIC, the open corpus of 28 trillion pixels that just retired ImageNet as the training standard. Good work.