Sakana Fugu Ultra: multi-agent orchestrator scores 95.5 GPQA, 73.7 SWE-Bench Pro, routes around export controls
Sakana AI, the Tokyo-based lab founded by Transformer co-author Llion Jones and David Ha, launched Fugu on June 22 as a multi-agent orchestration system—not a new base model, but a trained coordinator that routes queries across Claude Opus 4.8, GPT-5.5, and Gemini 3.1 Pro. Fugu Ultra, the flagship variant, dynamically delegates sub-tasks to whichever pool member handles them best, then synthesizes outputs into a single response. The system launched 10 days after the US Commerce Department restricted international access to Anthropic's Fable 5 and Mythos Preview under export controls on June 12.
On vendor-reported benchmarks (not yet independently reproduced), Fugu Ultra scores 95.5% on GPQA-Diamond (PhD-level science), 73.7% on SWE-Bench Pro (real GitHub issue resolution), and 82.1% on TerminalBench 2.1 (agentic terminal tasks). These place it shoulder-to-shoulder with restricted models on reasoning and science, but 12.3 points behind Fable 5 on SWE-Bench Pro (86% vs 73.7%). The orchestration approach delivers frontier performance without deploying a monolithic frontier model; Sakana's thesis is that coordinated ecosystems outperform isolated giants on hard, long-running tasks.
Sakana Fugu is grounded in two ICLR 2026 papers: TRINITY (an evolved LLM coordinator) and Conductor (learning to orchestrate agents in natural language). The system's economic angle: Fugu Ultra ($5/$30 per million tokens, subscription from $20/month) is operationally cheaper than running frontier models independently for many workloads, and its swappable agent pool hedges against vendor lock-in and future export restrictions. Early adopters report 20-30 minute latencies on complex tasks and quota exhaustion on base plans, flagging operational constraints.
For practitioners: Fugu Ultra represents a novel approach to AI access in a multi-polar regulatory environment. The orchestration layer sidesteps training costs and avoids monolithic model dependency. However, all published benchmarks are Sakana-reported without independent verification, SWE-Bench Pro shows a meaningful gap to Fable 5 for software engineering workflows, and the closed-source routing logic prevents auditability. Real production testing on your workload is essential before relying on published scores.