Mistral has not released a 30B parameter mixture-of-depths code model, nor has there been any announcement, API catalog, or weight repository to confirm its existence. The speculation is noteworthy as it would fill a logical gap in Mistral's production-hardened code stack, which includes Codestral for autocomplete, Devstral 2 for agents, Devstral Small 2 for local inference, and the generalist Mistral Large 3.
The current lineup is architecturally diverse. Codestral is a 22B dense model optimized for fill-in-the-middle completion, featuring a 256K context window in its current 25.XX API release, priced at $0.30 per million input tokens and $0.90 per million output tokens on Mistral's dedicated endpoint. It achieves 95.3% pass@1 on FIM benchmarks and is cost-effective for per-keystroke calls. Devstral 2 is a 123B dense transformer designed for agentic coding, scoring 72.2% on SWE-Bench Verified, with API pricing at $0.40/$2.00 per million tokens. Devstral Small 2 is a 24B Apache 2.0 model that operates on a single RTX 4090 or a MacBook M-series Pro/Max for air-gapped work. Mistral Large 3 is a sparse mixture-of-experts model with 41B active parameters drawn from 675B total, also Apache 2.0, trained on approximately 3,000 H200 GPUs.
A hypothetical 30B active-parameter mixture-of-depths model would bridge the gap between Codestral and Large 3, relying on per-token routing decisions rather than sparse expert selection. Unlike MoE, which dispatches tokens to different feed-forward networks, mixture-of-depths routes tokens to different layer depths, skipping later layers when an intermediate confidence threshold is met. This approach reduces the average forward-pass cost below that of a dense 30B model while maintaining peak capacity for complex tokens. However, it introduces operational complexity as dynamic depth routing disrupts static batching, KV-cache sizing, and throughput benchmarking on standard vLLM or Triton stacks, as each token in a batch may exit at a different layer. Memory bandwidth savings are only realized if the inference engine can handle early exits without padding the entire batch to full depth.
In the absence of Mistral publishing weights, an endpoint, or evaluations for such a model, the 30B MoD claim remains speculative. The existing family already illustrates the trade-offs architects face. Codestral excels in latency and price but lacks the reasoning depth for multi-file refactoring. Devstral 2 manages this at roughly 3–6× the token cost depending on output length. Devstral Small 2 offers offline inference with 24B-scale accuracy. No confirmed MoD option yet provides variable compute cost at code-model quality.
Adopt the tiering strategy now, not the unconfirmed routing mechanism: use a cheap 22B endpoint for autocomplete, route complex agentic tasks to a 123B API, and maintain a 24B Apache 2.0 checkpoint on local hardware for pre-commit or offline generation. If a MoD model emerges, the key question will be whether its dynamic compute savings offset the overhead of custom CUDA kernels and uneven batch execution.