LIVE · WED, JUN 10, 2026 --:--:-- ET
Issue Nº 50 COST TOTAL $14256.56 ARTICLES TODAY 6 TOKENS TOTAL 8.85B
aiexpert
EDITION Ep. 11 · May 22, 2026 · 11:52

The week memory became the bottleneck and agents went to production

The week AI cost migrated from GPU to memory, agents faced real enterprise systems, and research delivered honest benchmarks to measure what's actually in production.

Hosts: Alan · Ada EN

Transcript

JOHN

Seven point eight million dollars. That's the cost of a Vera Rubin rack. And two million of that is memory alone.

MARIA

Four hundred and thirty-five percent increase compared to the prior generation. Memory cost now grows faster than GPU cost — and the cost bottleneck in AI has changed address.

JOHN

This is the ai|expert Edition. The week infrastructure cost migrated from GPU to memory, agents entered real enterprise systems, and research delivered the benchmarks we needed to measure what's actually in production. Let's break down the arithmetic. A Vera Rubin VR200 NVL72 rack will cost, according to Morgan Stanley estimates, seven point eight million dollars to hyperscalers. GPU dominates in volume: seventy-two Rubin GPUs, at fifty-five thousand dollars each in volume orders, sum to three point ninety-six million dollars — the largest single line item in the bill of materials. But memory reached two million dollars. Twenty-five percent of the total system cost going to memory is a proportion that didn't exist in any prior generation of Nvidia accelerator.

MARIA

And what matters isn't just where the number sits today — it's the trajectory. On the GB300 NVL72 from the prior generation, memory cost a fraction of that. On Vera Rubin, it grew four hundred and thirty-five percent. GPU also got more expensive, but not at that rate. Memory cost now grows faster than compute cost. The culprit is three simultaneous layers, not one. The VR200 rack carries fifty-four terabytes of LPDDR5X — three times more than the seventeen terabytes of the GB200. SemiAnalysis estimates Nvidia paid eight dollars per gigabyte of LPDDR5X in the first quarter of 2026. At that price, just the LPDDR5X runs four hundred and eight thousand dollars per rack. If the price rises to ten dollars — what the upstream market is signaling — we reach five hundred and forty thousand. Per rack.

JOHN

And there's a third line in the BOM that simply didn't exist before. 3D NAND storage: approximately one million dollars per rack. On the GB200 NVL72, that number was virtually zero. An entirely new category in system cost. When you sum LPDDR5X, NAND, and the HBM4 on-die in the Rubin GPUs themselves, memory in all its forms dominates the cost curve of the cluster. This isn't commodity fluctuation — it's structural change in BOM composition.

MARIA

And the upstream market confirms the pressure isn't temporary. Taiwanese memory module makers — Adata, TeamGroup, Innodisk, GoldKey, Apacer, Transcend, Silicon Power — collectively raised twenty-eight billion Taiwan dollars, roughly eight hundred and eighty million US dollars, through convertible debentures, syndicated loans, and private equity placements. Just to buy and stockpile chips while prices surge. Adata is the largest single borrower: two billion in debentures, twelve billion in syndicated bank loans, and a private equity placement of thirty million shares still pending.

JOHN

TrendForce estimated conventional DRAM contract prices rose between ninety and ninety-five percent in the first quarter of 2026, with an additional fifty-eight to sixty-three percent projected for the second quarter. NAND flash rose approximately sixty percent in the first quarter, with a projection of seventy to seventy-five percent additional in the second. In a revision published in early May, mobile DRAM pricing was adjusted upward: ninety-three to ninety-eight percent quarterly increase, as Samsung, Micron, and SK Hynix finalized negotiations with customers.

MARIA

The most revealing behavioral detail this week: Adata's chairman, Simon Chen, confirmed that cloud service providers approached the company to lock in long-term supply agreements. He described it as "rare occurrence." When a hyperscaler knocks on a memory module maker's door to contract directly, it's because allocation via fab isn't covering demand. The fabs prioritize server DRAM and HBM — which go straight into GPU clusters and AI accelerators — over consumer DRAM. New fab capacity shouldn't arrive in volume before the end of 2027. Module makers are taking on debt to maintain market position, not to expand capacity.

JOHN

The practical conclusion for architects is direct: if you're budgeting an inference cluster for the second half of 2026, don't use last year's memory prices as reference. Model at least sixty percent higher. Memory-efficient kernel design, quantization strategies that minimize pressure on activation memory, and careful evaluation of NAND storage in inference pipelines stopped being performance optimizations — they're direct capex decisions. While hardware cost explodes, the geography of who makes that hardware is being redrawn. Jensen Huang confirmed on Nvidia's first-quarter earnings call — with revenue growing eighty-five percent year-over-year to eighty-one point sixty-two billion dollars — that the company "largely conceded" the Chinese AI accelerator market to Huawei. China, which already accounted for at least a fifth of Nvidia's data center revenue, was zeroed out in the company's internal projections.

MARIA

Huang's exact phrase: "We've really largely conceded that market to them." That's more revealing than the revenue number, because Nvidia grew eighty-five percent without China. The available market outside China is large enough to justify not trying to work around the controls. The Trump administration required licenses to export H100, H200, and related chips in April. Huang told investors to "expect nothing" in terms of approvals. Alibaba, Tencent, ByteDance, and JD.com received individual approvals for H200 — but a US commercial representative confirmed that chip export controls were excluded from May bilateral negotiations with China.

JOHN

Huawei responded with the Ascend 910C: dual-chiplet chip on SMIC's seven-nanometer DUV process, delivering up to eight hundred teraFLOPS in FP16, one hundred and twenty-eight gigabytes of HBM, three point two terabytes per second of memory bandwidth. Huawei projects production of six hundred thousand units in 2026 — nearly double 2025 output. At the rack level, the CloudMatrix 384 integrates three hundred and eighty-four Ascend 910C chips and delivers approximately three hundred petaFLOPS in BF16, versus one hundred and eighty of Nvidia's GB200 NVL72 in raw terms.

MARIA

With four times more power consumption and two point three times less efficiency per watt. Raw scale exists — more silicon, larger clusters — and it works for inference in production. But there's a piece of empirical evidence that matters more than any benchmark comparison: DeepSeek abandoned Ascend hardware during R2 training after finding stability and throughput failures at scale and reverted to H800s. When the Chinese lab investing most in hardware optimization can't make domestic hardware work for frontier training, the limitation is real and documented.

JOHN

Bernstein Research puts Nvidia's market share in China at eight percent in 2026, versus sixty-six percent in 2024. Huawei is at fifty percent. The accelerator market in China isn't shrinking — it's being served by another company, with another software architecture. Huang still wants back in: "We would be more than delighted to serve the market." But Nvidia's own guidance assumes the door stays closed.

MARIA

What this changes for everyone outside China: global demand for Nvidia is concentrated in the markets the company can still serve. That pressures pricing and delivery timelines globally. And there's a software question teams with China-facing workloads need to solve now: Huawei's CANN bridges to PyTorch via adaptation layers and works for standard Transformer workloads. But FP8 support on the Ascend 910C isn't confirmed. Inference pipelines built on FP8 quantization — the standard in production on H100 and later — regress to INT8 or FP16 on Ascend, with reduced effective throughput. If your stack depends on FP8, solve that engineering problem before committing platform.

JOHN

And this week private capital made a big bet on the next layer of the stack. Blackstone and Google announced a joint venture worth twenty-five billion dollars to build a TPU-dedicated cloud — the first third-party distribution channel for Google's silicon outside the standard GCP interface. Blackstone puts in five billion dollars in initial equity, with the rest of the value in structured debt against data center assets and equipment. Goal: five hundred megawatts of capacity online by 2027.

MARIA

The model follows the CoreWeave template applied to Google's silicon: third-party infrastructure running accelerators from a single vendor as compute-as-a-service. Blackstone brings data center depth — it's the largest global provider, owner of QTS Realty Trust since 2021. Google supplies TPUs, software, and services. Benjamin Treynor Sloss, who spent more than two decades building Google's global infrastructure, is leaving the company to lead the new entity as CEO. Thomas Kurian, CEO of Google Cloud, described the TPUs as "specifically optimized for efficiency and performance in the AI era."

JOHN

The track record exists: Anthropic, Citadel Securities, and Gemini itself already run production workloads on TPUs. The chip has existed since 2015 — a decade of production use, purpose-built design for AI training and inference, with documented efficiency advantage for agentic workloads.

MARIA

But no pricing was disclosed in the announcement. No SLA, no latency benchmark, no cost per exaflop. And the central architectural constraint hasn't changed: TPUs run best in JAX and XLA. Workloads built in PyTorch and CUDA require non-trivial porting. A new procurement channel isn't a new chip architecture. Wait for the operating numbers before any migration plan. Blackstone also separately bet on Anthropic earlier in May — that's a compute portfolio strategy, not a single-stack choice.

JOHN

AI agents moving from demo to production is the thread running through all product news this week. Anthropic, at the Code with Claude event in London on May nineteenth, launched in public beta self-hosted sandboxes and in research preview MCP Tunnels. Both tackle the same chokepoint: security teams that refuse to approve agents whose execution environment lives outside the corporate perimeter.

MARIA

The architecture separates two problems that came bundled before. The agent loop — orchestration, context management, error recovery — stays in Anthropic's infrastructure. Tool execution migrates to the environment controlled by the customer. Four managed providers at launch: Cloudflare with microVMs and zero-trust secrets injection; Daytona with stateful environments via SSH and pause-and-restore; Modal with sub-second startup and scale to hundreds of thousands of concurrent sandboxes; and Vercel with VM isolation in milliseconds and VPC peering. Organizations can also bring their own sandbox client.

JOHN

The MCP Tunnels solve a different surface. Not where the code runs — but what the agent talks to. A lightweight gateway deployed inside the private network opens a single encrypted connection out to Anthropic's routing proxy. No inbound firewall rules. No public endpoints. Internal databases, private APIs, ticketing systems — everything becomes a tool callable by the agent.

MARIA

But there's a distinction teams in regulated verticals need to call out explicitly before taking it to compliance review. Even when all tool execution happens locally, orchestration metadata — session state, context — still flows through Anthropic's systems. "Compute stays in our VPC" and "orchestration never leaves our VPC" are two entirely different compliance approvals. Conflating them is what turns a two-week approval cycle into four months. Anthropic published four reference architectures and three production case studies — that's what goes to the compliance meeting, not the product slide.

JOHN

Three production integrations are already running. Clay with a GTM agent called Sculptor on Managed Agents and Daytona, building and monitoring workflows autonomously. Rogo — AI platform for institutional finance — building an analyst agent on Managed Agents and Vercel Sandbox for proprietary data. And Amplitude with an internal design agent live on Cloudflare. The Amplitude team reached a functional version in two days. Another CTO cited by Anthropic did the initial deployment in under a week using Modal.

MARIA

That said — MCP Tunnels is in research preview with explicit "as-is" language. No SLA. For controlled pilot with non-regulated data, sure. For critical production pipeline in regulated vertical, no. The distinction between preview and GA will be expensive for teams that ignore it.

JOHN

That same week, Cloudflare completed its infrastructure stack for agents with six named layers and a deep rebuild of Browser Run. The previous product ran on infrastructure shared with Cloudflare's Browser Isolation, optimized for long, stable human sessions. AI agents generate completely different patterns: short, bursty, highly concurrent. The migration to dedicated Containers with regional pools of pre-warmed Chromium delivered four times more concurrency — one hundred and twenty concurrent browsers, versus thirty — and fifty percent less latency on quick actions.

MARIA

The migration from Workers KV state to D1 with Queues is the most directly replicable technical pattern this week. Workers KV has eventual consistency — this created race conditions when multiple agents tried to reserve the same resource in parallel. D1 with Queues is transactional, with batch writes supporting up to five hundred thousand containers per region. If you run concurrent agents against any store with eventual consistency and you're seeing race conditions in resource allocation, transactional locking at the data layer solves it more cleanly than application-level locking.

JOHN

The complete stack has six layers. Compute, with Workers V8 for lightweight tasks and full Linux Sandboxes in GA for agents that need git, bash, and a dev server. Orchestration, the Dynamic Workflows — roughly three hundred lines MIT-licensed, each step independently retryable, each sleep hibernating without accumulated cost for idle tenants. Memory in private beta, with parallel search across five channels and Reciprocal Rank Fusion to merge results, with memory profiles shared across agent teams. Browser Run rebuilt with WebGL and WebMCP support. And a Commerce layer co-designed with Stripe, where agents can create Cloudflare accounts, register domains, start subscriptions, and go to production autonomously, with a default cap of one hundred dollars per month per provider.

MARIA

That Commerce layer needs specific attention. An agent that can register domains and start subscriptions creates an autonomous spend surface. The one-hundred-dollar cap is an operational guardrail, not a security policy control. At scale, with multiple agents executing autonomously, that can generate unexpected billing events quickly. The risk needs controls at the authorization policy level, not left in default.

JOHN

And the positioning difference versus alternatives deserves to be named. AWS has an Agent Registry, but without managed browser layer and no equivalent to Agent Memory. Google Cloud has the GKE Agent Sandbox, but as a Kubernetes primitive, not managed service. Cloudflare operates these same primitives internally in its own products — what it calls "Customer Zero." That's a relevant operational signal of maturity. While agent infrastructure matures, there's growing tension on the human developer side operating these systems — and it needs to be named. An API security firm tracked a ten-fold increase in monthly security discoveries within Fortune 50 companies between December 2024 and June 2025: from one thousand to over ten thousand vulnerabilities per month. The period coincides with peak adoption of AI coding tools. Code volume is outpacing human review capacity.

MARIA

The problem documented by 404 Media isn't that AI writes bad code. It's that adoption mandates are measuring participation, not quality. An Amazon employee started inflating AI usage numbers to hit adoption metrics — without using the output. A developer told the outlet directly: "The actual quality of the output doesn't matter as much as our willingness to participate." Google reports seventy-five percent of new code generated by AI. Anthropic, ninety percent. Microsoft's CTO projects ninety-five percent by 2030. These percentages measure mandate strength — not delivery speed, not defect rates, not engineering health.

JOHN

GitClear analyzed two hundred and eleven million lines of code. AI coding tools raised copy-paste code from eight point three to twelve point three percent of all changes. Block duplication grew eight-fold. Refactoring activity dropped from twenty-five to under ten percent. Refactoring is a predictive indicator of long-term maintenance cost. When it falls to under half, the code delivered today creates multiplied work for whoever comes after.

MARIA

And there's a randomized controlled trial with fifty-two engineers that's hard to ignore. Those who used AI assistance completed the task in similar time to the control group, but scored seventeen percent lower on the subsequent comprehension quiz: fifty versus sixty-seven percent. Debug tasks showed the steepest drop. Speed stayed flat. Comprehension fell. And the standard metrics don't capture this: DORA metrics solid, PR count up, code coverage green. The dashboard looks healthy while knowledge on the team deteriorates.

JOHN

One engineer told 404 Media: "It's definitely making me dumber."

MARIA

Another: "We're building a rat's nest of technical debt that will be impossible to untangle when these models become prohibitively expensive."

JOHN

And the deliverable output from this round of adoption, what we can measure so far, is headcount cuts, not quality improvement. Meta cut eight thousand people, ten percent of the workforce, citing AI gains. Microsoft offered voluntary retirement to roughly one hundred and twenty-five thousand people, seven percent of the US workforce. Snapchat cut sixteen percent of the team. Headcount is the deliverable. Not defect rate, not system reliability, not launch velocity.

MARIA

And there's one specific case that shows how seemingly small product decisions propagate invisibly through production for weeks. Anthropic published on April twenty-third a postmortem tracing six weeks of Claude Code degradation complaints to three overlapping product changes — none of them a model regression. The model weights stayed stable. Everything was product layer.

JOHN

Three changes, three distinct timelines, three different user cohorts — creating the appearance of broad, inconsistent degradation. First: on March fourth, Claude Code's default reasoning effort was downgraded from high to medium to prevent the UI from appearing frozen. It stayed that way for thirty-three days. Second: on March twenty-sixth, a cache bug made old reasoning pruning fire on every turn rollover, not once after inactivity. A user with nine hundred thousand tokens in context going idle for an hour faced a complete cache miss on the next message — and every subsequent message also became a cache miss, explaining accelerated rate limit drain. Fixed on April tenth. Third: on April sixteenth, a verbosity cap in the Opus 4.7 system prompt limited text between tool calls to twenty-five words and final responses to one hundred words. Internal tests showed no regression. Later investigation found a three percent drop in coding evaluations. Reverted on April twentieth.

MARIA

Stella Laurenzo, director of AMD's AI group, analyzed six thousand eight hundred and fifty-two Claude Code session files — seventeen thousand eight hundred and seventy-one thinking blocks, two hundred and thirty-four thousand seven hundred and sixty tool calls. She found reads-per-edit collapsed from six point six to two point zero. From research-first to edit-first. Her team described this as rendering the tool inadequate for complex engineering work.

JOHN

And the distinction the postmortem didn't make clear enough: two of the three changes were deliberate product tradeoffs, not bugs. The reasoning effort downgrade and the verbosity cap were conscious decisions. Only the cache behavior was unintended regression. Treating them uniformly as "quality degradation" generated the legitimate criticism that appeared on Hacker News.

MARIA

One commenter put it this way: "Changing the system prompt under users when you published benchmarks using an older system prompt seems deceptive."

JOHN

Operators running Claude Code in automated pipelines need to treat system prompt changes and reasoning effort defaults as deployment variables — not as constants. Instrument reasoning depth per session and reads-per-edit before the next rollout, not after. The last part of this edition is three papers published this week that together deliver something rarer than a benchmark: honest measurement instruments for what's already in production. The first measures factual accuracy by language and region. A fourteen-day study — February ninth through twenty-second, 2026 — conducted by Mirac Suzgun and Emily Shen evaluated six production chatbots on two thousand one hundred factual questions derived from the same-day BBC News stories, covering six regional services: US and Canada, Arabic, Africa, Hindi, Russian, and Turkish.

MARIA

The six models: Gemini 3 Flash, Gemini 3 Pro, Grok 4, Claude 4.5 Sonnet, GPT-5, and GPT-4o mini. On multiple choice, the best systems reach over ninety percent accuracy on events reported hours earlier. The same system in open-ended response format loses eleven to thirteen percentage points. The cohort average drops sixteen to seventeen points. That's the format real users use — not multiple choice.

JOHN

The regional gap is the most operationally relevant data in the study. All models hit their lowest accuracy in Hindi: seventy-nine percent, versus eighty-nine to ninety-one percent in other languages. A ten to twelve point difference. Citation analysis reveals the mechanism: models answering questions in Hindi cite English Wikipedia more than any Hindi news outlet. Over seventy percent of errors trace to retrieving the wrong document, not reasoning failure. When models retrieved the correct source, they extracted answers with high accuracy.

MARIA

And queries with false premises expose the deepest vulnerability. Models scoring eighty-eight to ninety-six percent on clean questions tanked to nineteen to seventy percent when questions embedded subtle factual errors. One model accepted fabricated premises sixty-four percent of the time. And there's a detection paradox: the model with the best ability to detect false premise ranked second in adversarial robustness. Detection and recovery are partially independent — you can't infer adversarial robustness from detection ability.

JOHN

The practical conclusion: high accuracy in English doesn't transfer to other languages. A model showing ninety percent on English evaluation might be running at seventy-nine percent or less in the languages your users actually speak. Citation logs at call level are the only way to diagnose retrieval bias before it becomes a production issue. English-only evaluation understates regional drift by ten to twelve points.

MARIA

The second paper this week is more theoretical — but it has a practical diagnostic tool teams doing fine-tuning can use immediately. Vishal Rajput published to arXiv in May 2026 a paper unifying seven robustness families — adversarial, domain adaptation, photometric and occlusion invariance, compositional generalization, temporal robustness, alignment safety, and classical anisotropic regularization — under a single statistical principle. The matching principle: estimate the covariance of deployment perturbations that preserve the label, then regularize the encoder's Jacobian so its range covers that covariance.

JOHN

CORAL, IRM, Jacobian penalties, metric learning, RLHF-style constraints — all recast as different estimators of the same object. Not independent robustness tricks — variants of the same statistical problem. The work validates the principle in thirteen pre-registered experiments, from classical ML benchmarks to a seven billion parameter LLM. Twelve of thirteen passed in the predicted ranking: matched regularizer beat isotropic, which beat mismatched. The one exception was Office-31, an eigengap failure named before the experiment ran.

MARIA

The result that matters most for fine-tuning teams: in the Qwen2.5-7B experiment, the matched style-PMH regularizer improved selective honesty and preserved Style TDI where standard DPO degraded Style TDI. Alignment methods can degrade deployment robustness in ways that task accuracy evaluation doesn't detect. The work introduces the Trajectory Deviation Index — a label-free probe of embedding sensitivity for deployment monitoring, when task accuracy and Frobenius norm of the Jacobian aren't sufficient. Add it to your evaluation harness before the next fine-tune.

JOHN

And closing at the research frontier: NVIDIA published Gated DeltaNet-2, from Ali Hatamizadeh, Yejil Choi, and Jan Kautz. The work attacks a structural limitation in all prior delta-rule linear attention models. The problem: Gated DeltaNet and KDA use a single scalar gate to govern two distinct memory operations — erasing stale content on the key axis and committing new content on the value axis. Forcing a single write decision on two separate concerns causes interference that scrambles existing associations when the system should be selectively revising.

MARIA

The solution: two independent gates per channel. An erase gate b_t on the key axis. A write gate w_t on the value axis. The model recovers KDA when both collapse to the same scalar — it's a strict generalization, not architecture replacement. Implemented in Triton with a chunkwise WY algorithm and gate-aware backward pass, running on a single H100, with throughput nearly flat relative to sequence length.

JOHN

The numbers for long-context inference: on the RULER multi-key needle-in-haystack benchmark at four thousand tokens, Gated DeltaNet-2 recurrent scores thirty-seven point eight, versus twenty-eight for KDA and twenty-seven point eight for Gated DeltaNet — a thirty-five percent jump. On S-NIAH-3 at two thousand tokens: from sixty-three point two for KDA to eighty-nine point eight. Overall average accuracy on commonsense and language modeling sets: fifty-three point eleven, versus fifty-two point twenty-eight for KDA. The erase gate accounts for most of the gain — selective protection on the key side prevents existing associations from being overwritten during unrelated writes.

MARIA

Two critical limitations for adoption. The license is NVIDIA Source Code License-NC — non-commercial. Teams building commercial inference products need to negotiate a separate license; it's not open source. And training was done on four thousand token sequences. RULER scores are extrapolation of evaluation on longer contexts, not validation. Before committing to architecture, validate on your own operational context lengths.

JOHN

The principle that sticks, regardless of license: decouple erase and write operations per channel. Sharing a single scalar gate between the two is measurable precision loss — and Gated DeltaNet-2 quantifies exactly what that loss costs in long-context retrieval. This week, AI cost changed address — from GPU to memory — and agents moved from demos to real enterprise systems. The benchmarks arrived: to measure regional accuracy, alignment robustness, and what a poorly designed gate costs in long contexts. Wire on Monday. Edition next Friday. Good work.