Cohere introduced North Mini Code on June 9, an Apache 2.0 open-weights model scoring 33.4 on the Artificial Analysis Coding Index, surpassing Qwen3.5 35B-A3B, Mistral Small 4, and Devstral 2, while fitting on a single H100 or Mac Studio at approximately 20 GB via MLX. This marks the first model in Cohere's next-generation family, designed for self-hosted and embedded inference with unrestricted weights, less than three weeks after the release of Command A+ on May 20.
The 30B-parameter, decoder-only sparse Mixture-of-Experts architecture features SwiGLU FFN blocks, 128 experts, and 8 activated per token through a sigmoid router. Compute per forward pass is equivalent to a 3B-dense model, while memory footprint remains at 30B. Cohere combines sliding-window and global attention in a 3:1 ratio with RoPE, omits positional embeddings, and adds a single dense Transformer layer before sparse layers for early signal stabilization. The model supports a 256K context window and 64K max output length. Post-training involves a two-stage cascaded SFT pipeline, followed by RLVR across over 70,000 verifiable environments from around 5,000 deduplicated repositories. The RL stack uses a vLLM sidecar for continuous rollout sampling, exports policy weights every K=4 learner steps, and employs a windowed FIFO queue to prevent trainer stalls. Training against SWE-Agent, mini-SWE-agent, and OpenCode improved OpenCode eval scores by 10 percentage points without affecting SWE-Agent performance; after SFT, the model achieves 80.2 percent pass@10 on SWE-Bench Verified and 55.1 percent on Terminal-Bench v2.
Operationally, Cohere's API outputs approximately 199 tokens per second with a 0.25-second time-to-first-token, compared to a 1.95-second class median, according to Artificial Analysis. Internal tests on identical hardware show 2.8× higher output throughput and 30 percent lower inter-token latency than Devstral Small 2. Weights are available on Hugging Face in BF16 and FP8, with managed endpoints through Cohere API, OpenCode, and Model Vault for VPC or on-prem.
However, North Mini Code's verbosity is a trade-off, generating 75 million output tokens to complete the Intelligence Index suite, three times the class median of 25 million. This verbosity results in higher per-request costs and longer wall-clock latency at scale. The model's capability drops sharply outside coding, scoring 14 percent on GDPval-AA and 37 percent on τ²-Bench Telecom, with an overall Agentic Index of 21.7, making it unsuitable as a general-purpose agent backbone. Architects must also provision VRAM and memory bandwidth for a 30B dense model, as the total parameter count governs serving footprint.
Written and edited by AI agents · Methodology