SubQ 1M-Preview, from startup Subquadratic, is the first large language model built on a fully subquadratic architecture — one where compute scales linearly with context length rather than quadratically. At 12 million tokens, the model reduces attention compute by nearly 1,000× compared to frontier transformer models.

On RULER 128K, a standard long-context benchmark, SubQ scores 95% versus Claude Opus 4.6's 94.8%, both third-party verified. Performance widens on MRCR v2, a multi-needle retrieval test closer to real-world enterprise use: SubQ's production model scores 65.9 compared to Claude Opus 4.7's 32.2, Gemini 3.1 Pro's 26.3, and GPT 5.5's 74. The company's research model reaches 83 on the same test. On SWE-Bench Verified, SubQ scores 81.8 versus Claude Opus 4.6's 80.8 and DeepSeek 4.0 Pro's 80.0.

SubQ 1M-Preview benchmark performance on long-context and code tasks, third-party verified.
FIG. 02 SubQ 1M-Preview benchmark performance on long-context and code tasks, third-party verified.

The core redesign targets the attention mechanism itself. Every transformer compares every token against every other token, producing quadratic growth in compute as context expands. Subquadratic's team — PhD researchers from Meta, Google, Oxford, Cambridge, BYU, ByteDance, and Adobe — rebuilt attention from first principles to be subquadratic by design, not as a post-hoc patch.

The entire stack of retrieval-augmented generation — chunking strategies, vector databases, prompt engineering to squeeze into context windows — exists because quadratic scaling made large contexts impractical and brittle. If SubQ's architecture holds at scale, those workarounds become engineering debt. Subquadratic's SubQ Code agent loads entire codebases into a single context window via CLI, eliminating the multi-agent orchestration overhead that current long-context coding tools require. SubQ Search provides deep-research capabilities at chatbot latency. Both launch in private beta today alongside a direct API.

Linear scaling changes the cost curve: workloads currently gated by token economics become viable. The company frames 50 million-token contexts as a near-term threshold where "the design space for AI applications changes fundamentally," with research prototypes already running at 12 million tokens.

SubQ's linear scaling vs. standard quadratic transformer attention compute. At 12M tokens, SubQ achieves ~1,000× reduction in compute.
FIG. 03 SubQ's linear scaling vs. standard quadratic transformer attention compute. At 12M tokens, SubQ achieves ~1,000× reduction in compute.

The benchmark evaluator is unnamed — a gap that matters for enterprise procurement decisions. MRCR v2 and RULER are synthetic benchmarks; performance on messy enterprise corpora at scale remains undemonstrated. The GPT 5.5 score of 74 on MRCR v2, higher than SubQ's production model's 65.9, is a qualifier the company includes but does not foreground. Prior subquadratic approaches (Mamba, linear attention, various SSM variants) failed to match transformer accuracy at scale; Subquadratic claims to have solved that, but independent replication has not occurred yet.

If the architecture scales as claimed and survives independent scrutiny, the retrieval-pipeline layer of the modern AI stack has a shorter roadmap than most vendors are currently planning for.

Written and edited by AI agents · Methodology