Mistral's 30B mixture-of-depths model remains unconfirmed but would fill a code-stack gap

Mistral has not released a 30B parameter mixture-of-depths code model, nor has there been any announcement, API catalog, or weight repository to confirm its existence. The speculation is noteworthy as it would fill a logical gap in Mistral's production-hardened code stack, which includes Codestral for autocomplete, Devstral 2 for agents, Devstral Small 2 for local inference, and the generalist Mistral Large 3.

The current lineup is architecturally diverse. Codestral is a 22B dense model optimized for fill-in-the-middle completion, featuring a 256K context window in its current 25.XX API release, priced at $0.30 per million input tokens and $0.90 per million output tokens on Mistral's dedicated endpoint. It achieves 95.3% pass@1 on FIM benchmarks and is cost-effective for per-keystroke calls. Devstral 2 is a 123B dense transformer designed for agentic coding, scoring 72.2% on SWE-Bench Verified, with API pricing at $0.40/$2.00 per million tokens. Devstral Small 2 is a 24B Apache 2.0 model that operates on a single RTX 4090 or a MacBook M-series Pro/Max for air-gapped work. Mistral Large 3 is a sparse mixture-of-experts model with 41B active parameters drawn from 675B total, also Apache 2.0, trained on approximately 3,000 H200 GPUs.

A hypothetical 30B active-parameter mixture-of-depths model would bridge the gap between Codestral and Large 3, relying on per-token routing decisions rather than sparse expert selection. Unlike MoE, which dispatches tokens to different feed-forward networks, mixture-of-depths routes tokens to different layer depths, skipping later layers when an intermediate confidence threshold is met. This approach reduces the average forward-pass cost below that of a dense 30B model while maintaining peak capacity for complex tokens. However, it introduces operational complexity as dynamic depth routing disrupts static batching, KV-cache sizing, and throughput benchmarking on standard vLLM or Triton stacks, as each token in a batch may exit at a different layer. Memory bandwidth savings are only realized if the inference engine can handle early exits without padding the entire batch to full depth.

FIG. 02 Mistral's code model stack: confirmed dense and sparse tiers with hypothetical 30B mixture-of-depths in the gap.

In the absence of Mistral publishing weights, an endpoint, or evaluations for such a model, the 30B MoD claim remains speculative. The existing family already illustrates the trade-offs architects face. Codestral excels in latency and price but lacks the reasoning depth for multi-file refactoring. Devstral 2 manages this at roughly 3–6× the token cost depending on output length. Devstral Small 2 offers offline inference with 24B-scale accuracy. No confirmed MoD option yet provides variable compute cost at code-model quality.

Adopt the tiering strategy now, not the unconfirmed routing mechanism: use a cheap 22B endpoint for autocomplete, route complex agentic tasks to a 123B API, and maintain a 24B Apache 2.0 checkpoint on local hardware for pre-commit or offline generation. If a MoD model emerges, the key question will be whether its dynamic compute savings offset the overhead of custom CUDA kernels and uneven batch execution.

Sources

Mistral's news page shows no mixture-of-depths code model announcement — only Mistral Medium 3.5 agent features as of April 2026
"Remote agents in Vibe. Powered by Mistral Medium 3.5."
mistral.ai ↗
Codestral is a 22B open-weight model trained on 80+ programming languages with fill-in-the-middle capability; original May 2024 launch carried a 32K context window
"With its larger context window of 32k (compared to 4k, 8k or 16k for competitors), Codestral outperforms all other models in RepoBench"
mistral.ai ↗
Codestral 25.XX series carries a 256K context window; 95.3% pass@1 on FIM benchmarks
"Context Window: 256k tokens (standard across 25.XX series)"
devradar-dev.github.io ↗
Devstral Small 2 (24B, Apache 2.0) runs on a single RTX 4090 or MacBook M-series Pro/Max for air-gapped work
"Runs on single consumer GPU (RTX 4090) or high-end MacBook"
devradar-dev.github.io ↗
Codestral (current 25.XX API version) priced at $0.30/M input, $0.90/M output tokens
"Mistral's cutting-edge language model for coding released end of July 2025. Codestral specializes in low-latency, high-frequency tasks such as fill-in-the-middle (FIM), code correction and test generation."
openrouter.ai ↗
Devstral 2 is a 123B-parameter dense transformer scoring 72.2% on SWE-Bench Verified at $0.40/$2.00 per million tokens
"Devstral 2 is a state-of-the-art open-source model by Mistral AI specializing in agentic coding. It is a 123B-parameter dense transformer model supporting a 256K context window."
openrouter.ai ↗
Mistral Large 3 is a sparse MoE with 41B active parameters out of 675B total, trained on ~3,000 H200 GPUs
"It is a sparse mixture-of-experts (MoE) model featuring 41 billion active parameters and a total of 675 billion parameters, trained from scratch on an exascale NVIDIA GPU cluster (about 3,000 H200 GPUs)"
intuitionlabs.ai ↗
Mixture-of-depths technique routes tokens to different layer depths, skipping later layers when confidence threshold is met; can be 50%+ faster during post-training sampling
"These models match baseline performance for equivalent FLOPS and wall-clock times to train, but require a fraction of the FLOPs per forward pass, and can be upwards of 50% faster to step during post-training sampling."
github.com ↗

Mistral's 30B mixture-of-depths model remains unconfirmed but would fill a code-stack gap

Get the signal before the noise.

Get the signal before the noise.