Cohere's North Mini Code Tops Rivals on Single GPU

Cohere introduced North Mini Code on June 9, an Apache 2.0 open-weights model scoring 33.4 on the Artificial Analysis Coding Index, surpassing Qwen3.5 35B-A3B, Mistral Small 4, and Devstral 2, while fitting on a single H100 or Mac Studio at approximately 20 GB via MLX. This marks the first model in Cohere's next-generation family, designed for self-hosted and embedded inference with unrestricted weights, less than three weeks after the release of Command A+ on May 20.

The 30B-parameter, decoder-only sparse Mixture-of-Experts architecture features SwiGLU FFN blocks, 128 experts, and 8 activated per token through a sigmoid router. Compute per forward pass is equivalent to a 3B-dense model, while memory footprint remains at 30B. Cohere combines sliding-window and global attention in a 3:1 ratio with RoPE, omits positional embeddings, and adds a single dense Transformer layer before sparse layers for early signal stabilization. The model supports a 256K context window and 64K max output length. Post-training involves a two-stage cascaded SFT pipeline, followed by RLVR across over 70,000 verifiable environments from around 5,000 deduplicated repositories. The RL stack uses a vLLM sidecar for continuous rollout sampling, exports policy weights every K=4 learner steps, and employs a windowed FIFO queue to prevent trainer stalls. Training against SWE-Agent, mini-SWE-agent, and OpenCode improved OpenCode eval scores by 10 percentage points without affecting SWE-Agent performance; after SFT, the model achieves 80.2 percent pass@10 on SWE-Bench Verified and 55.1 percent on Terminal-Bench v2.

Operationally, Cohere's API outputs approximately 199 tokens per second with a 0.25-second time-to-first-token, compared to a 1.95-second class median, according to Artificial Analysis. Internal tests on identical hardware show 2.8× higher output throughput and 30 percent lower inter-token latency than Devstral Small 2. Weights are available on Hugging Face in BF16 and FP8, with managed endpoints through Cohere API, OpenCode, and Model Vault for VPC or on-prem.

However, North Mini Code's verbosity is a trade-off, generating 75 million output tokens to complete the Intelligence Index suite, three times the class median of 25 million. This verbosity results in higher per-request costs and longer wall-clock latency at scale. The model's capability drops sharply outside coding, scoring 14 percent on GDPval-AA and 37 percent on τ²-Bench Telecom, with an overall Agentic Index of 21.7, making it unsuitable as a general-purpose agent backbone. Architects must also provision VRAM and memory bandwidth for a 30B dense model, as the total parameter count governs serving footprint.

Sources

North Mini Code is a 30B-parameter sparse MoE with 128 experts, 8 active per token, 256K context, 64K max output, Apache 2.0
"a 30B-parameter Mixture-of-Experts model with 3B active parameters with powerful agentic coding capabilities, available on Hugging Face under the Apache 2.0 license"
huggingface.co ↗
Coding Index score of 33.4, outperforming Qwen3.5 35B-A3B, Gemma 4 26B, Devstral Small 2, and models up to 123B
"On Artificial Analysis' Coding Index, North Mini Code achieves a score of 33.4, outperforming Qwen3.5 (35B-A3B), Gemma 4 (26B-A4B), Devstral Small 2 (24B Dense), and even substantially larger models such as Nemotron 3 Super (120B-A12B), Mistral Small 4 (119B-A6B), and Devstral 2 (123B)"
huggingface.co ↗
SWE-Bench Verified 80.2% pass@10 and Terminal-Bench v2 55.1% pass@10 after SFT
"The final SFT model achieves 80.2% pass@10 on SWE-Bench Verified and 55.1% pass@10 on Terminal-Bench v2"
huggingface.co ↗
RLVR training used 70,000+ verifiable tasks across ~5,000 repositories, deduplicated against SWE-Bench
"we used over 70k verifiable tasks across ~5k unique repositories. We deduplicate our environments against the repository sources from SWE-Bench and SWE-Bench-Pro"
huggingface.co ↗
Multi-harness training yielded a 10 percentage-point gain on OpenCode evaluation while maintaining SWE-Agent performance
"Cohere reports a 10 percentage point gain on OpenCode evaluation from the multi-harness approach while maintaining SWE-Agent performance"
huggingface.co ↗
Intelligence Index score of 27.6; non-coding agentic scores: 14% GDPval-AA, 37% τ²-Bench Telecom, overall Agentic Index 21.7
"Achieves 27.6 on the Artificial Analysis Intelligence Index... it underperforms on non-coding agentic tasks, scoring 14% on GDPval-AA and 37% on τ²-Bench Telecom"
artificialanalysis.ai ↗
Generated 75 million output tokens to complete the Intelligence Index suite vs. a class median of 25 million (3× more verbose)
"the model generated 75 million output tokens to complete the Intelligence Index against a class median of 25 million. In high-volume agentic pipelines, that verbosity compounds into inference cost and latency."
venturebeat.com ↗
~199 output tokens/second on Cohere API; 0.25s TTFT vs. 1.95s class median per Artificial Analysis
"Artificial Analysis independently ranks it eighth of 127 comparable open-weight models on output speed at 210 tokens per second, with a time to first token of 0.25 second against a class median of 1.95 seconds."
venturebeat.com ↗
~199 output tokens/second on Cohere's API, faster than comparable open-weight models in its size class
"On Cohere's API, North Mini Code is faster than several comparable open weights models of its intelligence and size class (~199 output tokens per second)"
artificialanalysis.ai ↗
2.8× higher output throughput and 30% lower inter-token latency vs. Devstral Small 2 in internal tests on identical hardware
"Cohere claims 2.8x higher output throughput and a 30% inter-token latency advantage over Devstral Small 2 in internal tests under identical hardware configurations"
venturebeat.com ↗
Nick Frosst demoed it running on a Mac Studio via MLX at ~20 GB RAM
"Nick Frosst, co-founder of Cohere, demoed it running on a Mac Studio via MLX at around 20 gigabytes of RAM, the same machine he uses for his own local coding work."
venturebeat.com ↗
North Mini Code launched less than three weeks after Command A+ (May 20), which scored 37 on the Artificial Analysis Intelligence Index
"The release comes less than three weeks after Cohere launched Command A+, its previous model, on May 20. Command A+ received a score of 37 on the Artificial Analysis Intelligence Index."
cryptobriefing.com ↗
Command A+ released May 20, 2026 — first model in the Command A family with MoE architecture, Apache 2.0
"Today, we're releasing Command A+ open-source. A mixture-of-experts (MoE) model, Command A+ is an efficient, versatile, and privately deployable LLM built for high-performance agentic tasks"
cohere.com ↗
Available on Hugging Face (BF16/FP8), Cohere API, OpenCode, and Model Vault for VPC/on-prem deployment
"Download the weights on Hugging Face, or deploy in a dedicated, managed inference environment on Model Vault. Alternatively, try it for free in your harness of choice on OpenCode or with a Cohere API key."
cohere.com ↗

Written and edited by AI agents · Methodology

Cohere's North Mini Code Tops Rivals on Single GPU

Get the signal before the noise.

Get the signal before the noise.