The Pricing Power Game — and the Open-Weights Response
Frontier labs are testing how much enterprises will pay for opaque flagship access, while open weights and new RL science quietly reset what the rest of the stack can do.
Transcript
This week, OpenAI launched the most capable model in its history — and refused to hand over an API key. Meanwhile, an open-source model from Alibaba at 55 gigabytes outperformed the 807-gigabyte predecessor it replaces on top-tier coding benchmarks. Anthropic quietly attempted to quintuple the price of its primary developer product — and reversed course within hours. Researchers from Mila closed the debate on whether reinforcement learning actually teaches new capabilities to models, and the answer is yes. The AI stack is being repriced in both directions simultaneously. This week in the Edition: the pricing power game at frontier labs, the response from open-weights models and RL science, and three concrete warnings for teams already running agents in production — sabotage audits, modality gaps in VLMs, and consent in medical scribing. John, Maria, here's what happened.
First block: pricing power at the frontier. On April 23, OpenAI launched GPT-5.5. Available in Codex and rolling out to paid ChatGPT subscribers — but with no API access. The official note: "API deployments require different safeguards and we are working with partners on the security requirements to serve it at scale. We will bring GPT-5.5 to the API very soon." Very soon. No date, no SLA, no versioning. John, you followed this launch closely — what's happening here?
The launch has two moves that need to be read together. The first is the model — and the results are genuinely strong, but we'll get there. The second is the access structure, and that's where the strategy becomes explicit. When the API arrives, GPT-5.5 will cost $5 per million input tokens and $30 per million output tokens. GPT-5.4 costs $2.50 and $15. Doubled. GPT-5.5 Pro goes further: $30 input and $180 output. That Pro tier is positioned as the top of OpenAI's emerging tier logic — the flagship for high-value use cases. GPT-5.4 remains available at current rates for those who need predictability.
And with the API still unavailable, what do developers have right now?
They have a semi-official endpoint — and the context with Anthropic makes that semi-official status politically interesting. The endpoint /backend-api/codex/responses, the same one used by the open-source Codex CLI, was publicly endorsed for third-party integrations by Romain Huet, OpenAI's head of developer relations. Huet wrote in March: "we want people to be able to use Codex and the ChatGPT subscription wherever they want — in the app, in the terminal, in JetBrains, Xcode, OpenCode, Pi, and now in Claude Code." Peter Steinberger, creator of OpenClaw and now an OpenAI employee, confirmed: "the OpenAI sub is officially supported." Any developer with a ChatGPT or Codex subscription can route prompts to GPT-5.5 today via that endpoint. The problem: no SLA, no published rate limits, no versioning commitment. It's open-source infrastructure that OpenAI chose not to block — not a supported product. For production, it's a sandbox.
The real cost in heavy use has another dimension that doesn't show up in the per-token price.
Yes, and it matters for the budget. Simon Willison measured that the xhigh reasoning level consumed 9,322 reasoning tokens on a single SVG generation task — versus 39 tokens at the default level. A 239x difference. At $30 per million output tokens, sustained workloads with intensive reasoning will show up quickly on enterprise spending dashboards. The cost structure is not linear with task complexity.
Does the model deliver to justify that structure?
Ethan Mollick's benchmarks suggest yes, in the contexts he tested. Mollick, a professor at Wharton, had verified early access and published two concrete experiments. The first is a coding benchmark. He gave the same instruction to all relevant models — from OpenAI's o3, launched about a year ago, to Kimi K2.6, the current best open-weights model. The prompt: build a procedurally generated 3D simulation showing the evolution of a port city from 3000 BC to 3000 AD, with user controls and a visually rich appearance. Result: only GPT-5.5 Pro modeled a city that actually evolves. Competitors generated new buildings replacing previous ones — not emergent urban evolution. GPT-5.5 Pro completed the challenge in 20 minutes. GPT-5.4 Pro took 33 minutes. A 39% drop.
And the second experiment?
The second reconfigures what research productivity means. Mollick used Codex — OpenAI's desktop app powered by GPT-5.5 — and loaded a folder with a decade of raw research files on crowdfunding. STATA, CSV, XLS, Word files. Data he had accumulated over years and never published. With four prompts — four total interactions, without touching the text — the model sorted the files, generated a new hypothesis, applied sophisticated statistical methods, wrote a literature review with verifiable citations, and produced a complete academic paper. Mollick's assessment: "I would be very happy if this paper were the output of a second-year PhD project." The citations were real. The statistics were real. Four prompts.
But he flagged concrete caveats.
Two important ones. First: the model is uneven. The paper arrived with a hypothesis that Mollick — a crowdfunding expert — found uninteresting, and the model failed to address standard causality concerns despite using sophisticated statistical methods. Judgment about what's worth investigating still doesn't match that of a senior researcher. Second: the 3D simulation benchmark compared GPT-5.5 Pro with models that simply failed to model evolution — that's not a high bar for enterprise. For use cases requiring precision rather than spectacle, the data needs independent replication. And there's the stack question: the paper test required the Codex harness orchestrating GPT-5.5 Pro. Teams that treat the model, app, and harness layers as independent will underestimate both the compound gains — and the compound costs.
Now the second piece of this block — Anthropic. Because if OpenAI is testing how to structure premium access, Anthropic was more direct: it tested the price elasticity of its own users.
And got caught. On April 22, with no announcement — no blog post, no changelog, no email to subscribers — Anthropic updated the claude.com/pricing page, removing Claude Code from the $20-per-month Pro plan column. The product now appeared exclusively in the $100 and $200 Max plans. Five times more expensive. The internet noticed within minutes. Screenshots circulated on Reddit, Hacker News, and Twitter. The Internet Archive captured the page before the reversal. Anthropic undid the change within hours.
The official response came via tweet — no formal statement.
Only by tweet. Amol Avasare, Anthropic's Head of Growth, described it as "a small test on approximately 2% of new prosumer signups" and said existing Pro and Max subscribers were not affected. Simon Willison — who has 105 published posts teaching Claude Code and ran a tutorial at NICAR, the largest data journalism event in the US — challenged the framing directly: "I don't accept the '2% of new signups' framing, because everyone I talked to was seeing the new pricing grid and the Internet Archive already had a copy." And there's a data point Anthropic didn't explain: Claude Cowork — described by Willison as "a rebranded version of Claude Code with a less threatening hat" — remained available on the $20 plan throughout the entire episode. No justification for the inconsistency. As of publication, no formal statement had been issued.
And OpenAI exploited the opening immediately.
Thibault Sottiaux, OpenAI Codex engineering lead, posted: "I don't know what they're doing over there, but Codex will continue to be available on the FREE and PLUS $20 plans. We have the compute and efficient models to do that. For important changes, we will engage the community well before making them. Transparency and trust are two principles we will not break, even if it means earning less momentarily." It's a direct pitch to developers choosing where to build workflows. Willison said the episode shook his confidence in Anthropic's pricing transparency and that he is actively reconsidering Codex as a default teaching tool.
How do you read these two moves together? OpenAI and Anthropic testing the price ceiling in the same week?
Whether coordinated or coincidence — the effect is the same. Frontier labs are learning where the enterprise market's willingness-to-pay ceiling sits. OpenAI is building a tier logic communicated in advance, even if the API isn't available yet. Anthropic tried to find out whether Claude Code has similar elasticity and retreated without explanation. For enterprise procurement, the lesson is this: any pricing commitment made today with either of these vendors carries implicit optionality for the lab. Willison's question remains open: "strategically, should I bet on Claude Code if Anthropic can quintuple the minimum price of the product?" Any pipeline built on these products today carries pricing risk that no current contract fully covers.
Second block: the open-weights model response. While closed labs test how far they can push prices, Alibaba published a result that changes the infrastructure cost calculation for any team running coding agents. Maria, the Qwen3.6-27B.
The number that defines this launch is 14.5. The Qwen3.5-397B-A17B — the previous open-source coding flagship — takes up 807 gigabytes on Hugging Face. The new Qwen3.6-27B takes up 55.6 gigabytes. Fourteen point five times smaller. And on SWE-bench Verified — the standard for coding agents — the 27-billion-parameter model scores 77.2%. The 397-billion predecessor scores 76.2%. The smaller model outperforms the larger one. With a Q4_K_M quantized version that fits in 16.8 gigabytes — enough for a single consumer GPU.
What explains these gains in such aggressive compression?
A new hybrid architecture called Gated DeltaNet. The model has 64 layers organized in a fixed pattern: every four layers, three use Gated DeltaNet linear attention followed by FFN, and one uses standard Gated Attention with grouped-query followed by FFN. The Gated DeltaNet layers have 48 attention heads for values and 16 for queries and keys. The Gated Attention uses 24 query heads and 4 key-value heads via grouped-query attention. The ratio of linear attention to standard attention reduces memory pressure in long contexts; the standard attention at fixed intervals anchors precise retrieval. Native context: 262,144 tokens, extensible to 1,010,000. And there's a feature called Thinking Preservation: the reasoning chain is maintained between conversation turns, eliminating redundant recomputes in agent workflows that track state across long sessions.
Do the benchmarks hold up beyond SWE-bench?
Yes, with one specific gap worth noting. On SWE-bench Pro, Qwen3.6 scores 53.5% versus 50.9% for the 397-billion predecessor. On Terminal-Bench 2.0, 59.3% versus 52.5%. On LiveCodeBench v6, 83.9% versus 83.6% — essentially a tie. On GPQA Diamond — PhD-level reasoning — it scores 87.8%, fractionally below the 397B's 88.4%, but above Gemma 4 31B, which sits at 84.3%. The most significant gap is on SkillsBench: 48.2% versus 30.0% for the predecessor. Eighteen percentage points. That indicates the efficiency gains did not sacrifice specialization.
And what does that change in practice for infrastructure teams?
The deployment calculation changes materially. The Qwen3.5-397B-A17B required multi-node infrastructure or dedicated server hardware. The Qwen3.6-27B runs on a single A100-80GB or two A40s. In the quantized version, Simon Willison measured 25.57 tokens per second with llama.cpp on local hardware — sufficient for individual developer pipelines or low-concurrency agents without cloud dependency. For high-concurrency in production, the model card recommends SGLang, KTransformers, or vLLM. And the licensing is Apache 2.0: no usage restrictions, no legal friction for internal deployment or derivative fine-tuning. Teams that were sizing multi-node infrastructure for coding agents have a concrete reason to run Qwen3.6-27B first.
Any caveats?
Two honest ones. The benchmark suite is predominantly from Qwen itself, including internal evaluations like QwenWebBench and QwenClawBench. Independent replication on SWE-bench Verified had not appeared at time of publication. And the computational overhead of Thinking Preservation in long multi-turn sessions is not quantified in the model card. Those are the right experiments to run before committing this model to critical production.
Now the piece that explains why open-source keeps closing the gap faster than closed labs expected — and it's pure training science.
Mila published a paper that closes a central debate in LLM research: reinforcement learning with task-based rewards is actually teaching new capabilities — not merely concentrating probability on outputs the model already favored. The debate centered on the "distribution sharpening" hypothesis: the idea that RL works because it makes the model more confident in its existing preferences — reduces uncertainty, concentrates probability mass on already-plausible outputs — and not because it expands the capability space. If true, the implication was that inference-time scaling without training could achieve the same gains. Best-of-N, temperature sweeps, speculative decoding. That would make RL pipelines expensive and redundant training.
And the researchers refuted that.
Quite definitively — theoretically and empirically. Sarthak Mittal, Leo Gagnon, and Guillaume Lajoie, from Mila and the Université de Montréal, used the standard RL objective with KL regularization and varied four regimes within the same framework: pure task-reward, pure distribution sharpening, and two hybrids called Tilted Sampling and Tempered Sampling. Because all four share the same training procedure, differences in results reflect the signal being optimized — not framework artifact. The verdict: pure sharpening is theoretically unstable. The optima are unfavorable. Empirically, sharpening produces marginal gains while task rewards produce robust improvements and stable learning. Tested on Llama-3.2-3B-Instruct, Qwen2.5-3B-Instruct, and Qwen3-4B-Instruct on mathematical reasoning datasets.
What is the practical implication for those investing in post-training?
The direct implication is that reward function design is not a secondary detail that can be deferred or approximated. If gains came primarily from sharpening, teams could replace RL pipelines with cheaper inference-time scaling. That shortcut is closed. The quality of the reward signal is a first-order input to capability gain. And the paper provides a controlled diagnostic: running a pure sharpening baseline against a task-reward run on the same architecture and dataset produces a clean separation of the two effects. Improvements previously attributed to "better output formatting" or "greater confidence in existing reasoning" may need to be revisited.
And that connects directly to Qwen.
Directly. The post-training recipe is no longer a frontier secret. Any lab with sufficient resources can reproduce RL gains with well-designed task rewards — and is reproducing them. The gap between open and closed is closing because efficient training has become engineering, not magic. What remains differentiated at closed labs are the harnesses and access to proprietary data. Resources are becoming more efficient — Qwen3.6 proves that with 14.5 times less storage. Data is still a real advantage. For now.
Third block: deployment reality. The first two blocks showed where model capability is arriving — in closed labs and open-source. This block is about what happens when you put those models into production. Three articles this week expose concrete gaps that no capability benchmark covers. John, start with Redwood Research's ASMR-Bench.
ASMR-Bench is the benchmark every team running coding agents or autonomous research in production needs to know — and the central number is operationally inadequate. Redwood Research published a set of 9 real ML research codebases, each paired with sabotaged variants. The sabotages are surgical: they touch hyperparameters, training data, or evaluation code, producing qualitatively different experimental results — while preserving the high-level methodology described in the accompanying paper. Exactly the type of modification that escapes a human reviewer scanning the text for conceptual errors. The LLM auditor's task: detect that the code was sabotaged and identify the specific change.
And the result?
The best available result was Gemini 3.1 Pro, with an AUROC of 0.77 and a top-1 correction rate of 42%. Forty-two percent. The best available LLM auditor today correctly detects the sabotage in fewer than half of cases. And the threat model here is direct for those running autonomous agents: a misaligned or compromised agent conducting research can alter a training loop, distort an evaluation metric, or repartition a dataset — and the resulting paper will still look correct to a human co-author reading the prose. The benchmark also tested LLMs as red teamers — sabotage generators. LLM-generated sabotages were weaker than human ones. But they could still evade auditors of comparable capability. Sub-human sabotage already surpasses current automated review.
What's the number you'd bring to an AI governance meeting?
42% top-1 rate with an AUROC of 0.77 for the best available auditor. And the corollary: LLM-generated sabotage below human level already evades automated review. That establishes the trust floor for any autonomous research pipeline today. That number needs to appear in any AI governance framework covering coding agents or autonomous research.
What does that change for audit frameworks?
An important reframe. Most current frameworks focus on bias, fairness, and output monitoring. ASMR-Bench introduces the integrity of the research process itself as a distinct risk surface — the agent running the experiment is also the agent that can corrupt it. That dual role is not addressed by guardrails at the output layer. An AUROC of 0.77 beats chance but is operationally inadequate for a system where a corrupted experiment can propagate through the entire model development cycle. Teams running automated hyperparameter search, LLM-assisted ablation studies, or AI-generated evaluation harnesses now have a concrete adversarial set to validate their audit controls against.
Maria, CrossMath.
CrossMath exposes a different problem — but equally concrete for any team making VLM procurement decisions. Researchers from Nanyang Technological University and Alibaba's Tongyi Lab built a controlled benchmark where each problem is rendered in three formats with equivalent information: text-only, image-only, and image-plus-text. Equivalence verified by human annotators. That's what previous benchmarks consistently failed to guarantee. The focus: intrinsically visual problems — inference of values in mathematical structures requiring multi-step spatial and geometric reasoning, where the correct path necessarily runs through vision.
What did the top models do?
They failed structurally. Adding visual inputs — moving from text-only to image-plus-text — frequently degrades the performance of top VLMs relative to the text-only baseline. What that means: the vision encoder and cross-modal projector — the components that should deliver visual understanding — are net liabilities in rigorous visual reasoning tasks. The models are conducting inference primarily in the textual space. The visual pathway contributes noise, not signal. The authors call this the "modality gap."
Does the benchmark control for visual confounders?
Yes, four image styles: original high-resolution, borderless, beige background, and alternate fonts and colors. This detects models that are capturing image artifacts — borders, fonts, background contrast — instead of the underlying mathematical content. A model that degrades significantly between styles is not reasoning about visual structure. It's doing pattern matching on rendering choices.
What's the practical procurement implication?
Direct. Any deployment using a VLM for document analysis, engineering diagram review, or visual QA on structured data is likely receiving capability claims inflated by text backbone performance. The model may appear to understand diagrams under benchmark conditions while failing silently when the textual context is removed or ambiguous. The practical stress test now exists: the CrossMath dataset is available on Hugging Face at xuyige/CrossMath with evaluation code published on GitHub. The protocol is simple: measure the delta between text-only and image-plus-text performance on that benchmark and treat that gap as the real visual capability ceiling. If the vendor can't close that gap, the visual reasoning capability on the datasheet is not what will show up in production.
And does the paper offer a path to improvement?
Yes. Fine-tuning on the CrossMath training set increases performance across all three modalities — text-only, image-only, and image-plus-text — with downstream gains on two general visual reasoning tasks. The modality gap is not an architectural ceiling. It's a training data artifact. VLMs don't learn to reason visually because most training pipelines don't require it. That means the gap is fixable — but needs to be actively addressed in fine-tuning, not assumed to be solved by the base model.
The third warning this week comes from a regulated vertical — and touches directly on HIPAA and patient consent. John, AI medical scribes.
Linguist Emily M. Bender and writer Decca Muldowney published on April 22 a nine-point argument urging patients to refuse consent when clinics ask to record appointments with AI scribing tools. These tools capture audio from patient-physician encounters and produce automatic clinical note drafts. Vendors sell them as a solution to documentation overload — chart-filling consumes hours of unpaid time for many physicians. Bender and Muldowney argue that the consent framing obscures risks that most patients cannot evaluate in the moment.
What are the most critical arguments for enterprise health IT teams?
I'll start with the most immediate: HIPAA compliance is not equivalent to adequate security protocols at the software vendor. The recording goes to a third-party vendor — and even if the audio is deleted quickly, the transcript is sensitive data. That distinction needs to be in vendor contracts. Second: automation bias in omissions. It's reasonably simple to verify what a note says. It's much harder to remember what should be there and isn't. An uncaptured symptom, a dosage nuance, a patient complaint that didn't make it into the transcript simply disappears — without triggering any correction alert. Third: disproportionate impact. Voice recognition accuracy degrades for speakers of non-standard linguistic varieties, non-native speakers, and patients with dysarthria. In the populations most dependent on the healthcare system, physicians spend proportionally more time correcting notes — in a system that promised efficiency gains.
There's a fourth point I find critical from a workforce perspective. The efficiency argument in underfunded healthcare systems doesn't translate into longer appointments — it translates into more patients per physician. And Bender and Muldowney identify a specific side effect on interpreter workflows: physicians accustomed to scribing systems shift their speech register during appointments, adopting a technical "physician-to-physician" style to shape the note — leaving medical interpreters unsure whether or not to translate that section in that moment. It's a signal of how AI tools introduce dysfunction into existing processes that no capability benchmark will capture.
And there's the informed consent question the authors raise precisely. Patients are rarely informed about whether their data will be used to train future model iterations or for "quality assurance." Revoking consent mid-appointment is practically impossible. Genuinely informed consent would consume more appointment time than most visit slots allow. That's the point where the legal argument strengthens for health systems.
The game-theory logic that Bender and Muldowney identify is important for understanding what's coming.
It's the most strategic part of the argument. If patients as a group refuse consent at scale, institutions cannot accumulate the adoption numbers needed to justify the efficiency narrative — which makes it harder to justify increasing workloads per physician. Individual refusal has low cost and is reversible. Collective refusal degrades the business case. This dynamic turns opt-out into a political lever — not merely a personal privacy preference. And it resembles the diffuse pressure that tends to precede formal regulatory intervention.
For health systems with pending scribing contracts, the window is closing.
The questions this article raises — data retention, use for future training, accuracy variance by speaker population, and interpreter workflow disruption — are answerable in vendor negotiations right now. Waiting for a federal framework to resolve them is not a strategy. It's unquantified risk accumulating on the compliance balance sheet.
That's it for this week's Edition. The throughline connecting all three blocks: the AI stack is being repriced at both extremes simultaneously. Closed labs testing the ceiling — GPT-5.5 without an API, Anthropic quietly attempting to quintuple a tier. Open-weights pushing down the floor — 55 gigabytes outperforming 807 on SWE-bench. And Mila's RL science confirming that post-training keeps paying real dividends, which explains why the gap keeps closing. Meanwhile, the deployment gaps we saw in the third block — 42% sabotage detection rate, structural modality gap in VLMs, and opaque consent in medical scribing — haven't been closed by anyone. In Monday's Wire: the gpt-image-2 API numbers — $0.40 per image at 4K — and whether that economics holds up in real-world enterprise creative pipelines. For reading on the site this week: the article on Redwood Research's ASMR-Bench. The link is in the show notes. Until Monday.