LIVE · FRI, JUN 26, 2026 --:--:-- ET
Issue Nº 66 COST TOTAL $14557.66 ARTICLES TODAY 9 TOKENS TOTAL 9.16B
aiexpert
Running the wire
Funding Abu Dhabi's MGX closes $50B AI infrastructure fund; targets $100B AUM Chips STMicroelectronics launches ST54M with post-quantum crypto hardware accelerator for mobile Funding Peec AI targets $200M pre-money valuation in Series B, doubling post-money from $100M Series A Research Frontier models saturate GPQA-Diamond benchmark at 93–94%; SWE-bench Pro becomes key differentiator Policy Trump signs AI security executive order; 30-day voluntary prerelease model review Funding Autodesk acquires MaintainX for $3.6B, extending into industrial AI operations Market OpenAI leans toward 2027 IPO, holding firm on $1T valuation floor Funding ON Semiconductor acquires Synaptics for $7B in physical AI bet Market OpenAI shifts IPO target to 2027; Anthropic poised to list first in October at $965B Breaking OpenAI stalls GPT-5.6 public release; US govt to approve access customer-by-customer Funding Onsemi acquires Synaptics for $7B to consolidate Edge AI and Physical AI stack Chips Apple skips M6 Pro/Max, fast-tracks AI-focused M7 to late 2027 as bandwidth race heats up Market OpenAI leans toward delaying IPO to 2027 over market volatility, holding firm on $1 trillion valuation Chips Solidigm ships 122TB SSD with unlimited 5-year random-write endurance; 84% less NAS power than HDD+TLC Breaking Google launches Gemini 3.5 Flash: outperforms Pro tier on coding, 40% cheaper, 4x faster Funding SE3 Labs emerges from stealth with €5.5M for autonomous defense AI, backs German sovereignty Market Anthropic, OpenAI face enterprise cost-cutting as customers switch to cheaper models Research OpenAI Codex agents now primary tool across all departments; 80% of users complete 30+ minute tasks Chips IBM's 0.7nm Nanostack breaks sub-1nm barrier with 100B transistors on fingernail die Chips Nvidia triple-qualifies HBM4 suppliers; SK Hynix, Samsung, Micron all production-ready for Vera Rubin Q3 ship Funding Abu Dhabi's MGX closes $50B AI infrastructure fund; targets $100B AUM Chips STMicroelectronics launches ST54M with post-quantum crypto hardware accelerator for mobile Funding Peec AI targets $200M pre-money valuation in Series B, doubling post-money from $100M Series A Research Frontier models saturate GPQA-Diamond benchmark at 93–94%; SWE-bench Pro becomes key differentiator Policy Trump signs AI security executive order; 30-day voluntary prerelease model review Funding Autodesk acquires MaintainX for $3.6B, extending into industrial AI operations Market OpenAI leans toward 2027 IPO, holding firm on $1T valuation floor Funding ON Semiconductor acquires Synaptics for $7B in physical AI bet Market OpenAI shifts IPO target to 2027; Anthropic poised to list first in October at $965B Breaking OpenAI stalls GPT-5.6 public release; US govt to approve access customer-by-customer Funding Onsemi acquires Synaptics for $7B to consolidate Edge AI and Physical AI stack Chips Apple skips M6 Pro/Max, fast-tracks AI-focused M7 to late 2027 as bandwidth race heats up Market OpenAI leans toward delaying IPO to 2027 over market volatility, holding firm on $1 trillion valuation Chips Solidigm ships 122TB SSD with unlimited 5-year random-write endurance; 84% less NAS power than HDD+TLC Breaking Google launches Gemini 3.5 Flash: outperforms Pro tier on coding, 40% cheaper, 4x faster Funding SE3 Labs emerges from stealth with €5.5M for autonomous defense AI, backs German sovereignty Market Anthropic, OpenAI face enterprise cost-cutting as customers switch to cheaper models Research OpenAI Codex agents now primary tool across all departments; 80% of users complete 30+ minute tasks Chips IBM's 0.7nm Nanostack breaks sub-1nm barrier with 100B transistors on fingernail die Chips Nvidia triple-qualifies HBM4 suppliers; SK Hynix, Samsung, Micron all production-ready for Vera Rubin Q3 ship
Research

Frontier models saturate GPQA-Diamond benchmark at 93–94%; SWE-bench Pro becomes key differentiator

All top frontier models—Claude Opus 4.8, Gemini 3.1 Pro, and GPT-5.5—have converged to 93–94% on GPQA-Diamond, a PhD-level multiple-choice benchmark in biology, chemistry, and physics released in late 2023. The benchmark has become statistically saturated; the 0.7-point difference between first and third place is within margin of error. Two years ago (November 2023), GPT-4 scored just 39% on GPQA-Diamond; the jump to 93–94% by mid-2026 demonstrates rapid progress on graduate-level reasoning but also signals the benchmark no longer meaningfully differentiates frontier models. Anthropic, OpenAI, and Google have all declared GPQA saturation in their system cards.

The saturation shift changes which benchmarks matter for ranking. SWE-bench Pro (harder variant with less public leakage) and Humanity's Last Exam (expert-written reasoning) now show real separation. On SWE-bench Pro, Opus 4.8 leads at 69.2% versus GPT-5.5 at 58.6% and Gemini 3.1 Pro at 54.2%—a 15-point gap. On Humanity's Last Exam with tools, Opus 4.8 scores 57.9%, and the leaderboard spans a wider range, indicating still-open headroom. The field is resetting benchmarks, with FrontierMath (Epoch AI) and SWE-bench Verified (GitHub issues) emerging as harder filtering tasks. Most dramatic: Claude Opus 4.8 hit 96.7% on USAMO 2026 (Olympic-level proofs), a 27.4-point jump from Opus 4.7's 69.3%, signaling a qualitative shift in mathematical proof-level reasoning.

The implication: GPQA-Diamond and other saturated benchmarks no longer serve as capability filters. When downstream evals converge, the distribution of differentiation flips—it moves from high-level reasoning (which all models now handle well) to applied task performance (coding at scale, multi-tool agentic workflows, long-context synthesis, alignment/honesty). Benchmark saturation is not failure; it's evidence of progress. It also means that model selection decisions now rest on workload-specific evaluation rather than cross-domain reasoning comparisons.

For architects: if your evaluation relied on GPQA-Diamond or MMLU, refresh your benchmarking suite. Test against SWE-bench Pro (for coding), Humanity's Last Exam (for agentic reasoning with tools), and OSWorld or BrowserAgent evals (for real-world task completion). Watch Epoch AI's FrontierMath releases and Vals AI's domain-specific evaluations. Cost-per-correct-output is now more important than percentage-point ranking on saturated benchmarks. Plan your model selection around specific use cases, not general frontier leaderboards.

Sources