At 55.6 GB, Qwen3.6-27B Beats the 807 GB Model It Replaces on Coding Benchmarks

Alibaba's Qwen team released Qwen3.6-27B, a dense 27-billion-parameter model scoring 77.2% on SWE-bench Verified — outperforming the 397-billion-parameter Qwen3.5-397B-A17B (76.2%) at 55.6 GB versus that model's 807 GB. A Q4_K_M quantized build reduces the footprint to 16.8 GB, putting that benchmark performance on a single consumer GPU.

The gains trace to a Gated DeltaNet hybrid architecture. Each of the model's 64 layers follows a repeating pattern: three Gated DeltaNet → FFN blocks followed by one Gated Attention → FFN block. The Gated DeltaNet layers use 48 attention heads for values and 16 for queries and keys; the Gated Attention layers use 24 query heads and 4 key-value heads via grouped-query attention. This ratio of linear-attention to standard-attention layers reduces memory bandwidth pressure at long contexts while standard attention anchors precise retrieval at fixed intervals.

Native context length is 262,144 tokens, extensible to 1,010,000 tokens. The model ships with Thinking Preservation: chain-of-thought reasoning is kept across conversation turns rather than discarded after each response, cutting redundant recomputation in iterative coding workflows where agents track state across extended sessions.

The benchmark advantage extends beyond SWE-bench Verified. Qwen3.6-27B posts 53.5% on SWE-bench Pro versus 50.9% for the 397B predecessor, 59.3% on Terminal-Bench 2.0 versus 52.5%, and 48.2% on SkillsBench Avg versus 30.0% — the largest single gap in the comparison. LiveCodeBench v6 scores 83.9% (vs. 83.6%). On GPQA Diamond the model scores 87.8%, fractionally below the 397B's 88.4% but above Gemma 4 31B's 84.3%. The 18.2-point SkillsBench margin indicates the efficiency gains did not sacrifice specialization.

For enterprise AI architects, the deployment calculus changes materially. Qwen3.5-397B-A17B required multi-node GPU infrastructure or purpose-built server hardware; at 55.6 GB, Qwen3.6-27B fits on a single A100-80GB or across two A40s. At Q4_K_M quantization, Simon Willison measured 25.57 tokens per second running locally with llama.cpp — sufficient for single-developer or low-concurrency agent pipelines without cloud dependency. For high-throughput production, the model card recommends SGLang, KTransformers, or vLLM. The Apache 2.0 license carries no usage restrictions, removing legal friction for internal deployments and derivative fine-tuning.

Open questions remain. The benchmark suite is largely Qwen's own, including internal evaluations such as QwenWebBench and QwenClawBench; independent third-party replication on SWE-bench Verified has not appeared. The computational overhead of Thinking Preservation across extended multi-turn sessions is not quantified in the model card. Vision-language capabilities are bundled — the model is typed as a Causal Language Model with Vision Encoder — with multimodal benchmarks like MMMU (82.9%) and VideoMME (87.7%) showing incremental but not decisive gains over the 27B predecessor.

A 14.5× file-size reduction between two consecutive open-weight coding flagships, with a benchmark win on the leading agentic coding standard, erodes the economic case for 400B-class infrastructure. Teams scoping multi-node deployments for coding agents should run Qwen3.6-27B first.

Sources

Qwen3.6-27B scores 77.2% on SWE-bench Verified, outperforming Qwen3.5-397B-A17B at 76.2%
"SWE-bench Verified ... Qwen3.5-397B-A17B: 76.2 ... Qwen3.6-27B: 77.2"
huggingface.co ↗
Qwen3.5-397B-A17B is 807 GB on Hugging Face; Qwen3.6-27B is 55.6 GB
"On Hugging Face Qwen3.5-397B-A17B is 807GB, this new Qwen3.6-27B is 55.6GB."
simonwillison.net ↗
Q4_K_M quantized version of Qwen3.6-27B fits in 16.8 GB
"I tried it out with the 16.8GB Unsloth Qwen3.6-27B-GGUF:Q4_K_M quantized version"
simonwillison.net ↗
Gated DeltaNet hybrid architecture: 64 layers in pattern of 3× Gated DeltaNet → FFN then 1× Gated Attention → FFN
"Hidden Layout: 16 × (3 × (Gated DeltaNet → FFN) → 1 × (Gated Attention → FFN))"
huggingface.co ↗
Gated DeltaNet uses 48 attention heads for V and 16 for QK; Gated Attention uses 24 Q heads and 4 KV heads
"Gated DeltaNet: Number of Linear Attention Heads: 48 for V and 16 for QK ... Gated Attention: Number of Attention Heads: 24 for Q and 4 for KV"
huggingface.co ↗
Native context length is 262,144 tokens, extensible to 1,010,000 tokens
"Context Length: 262,144 natively and extensible up to 1,010,000 tokens."
huggingface.co ↗
Thinking Preservation retains reasoning context from historical messages across conversation turns
"Thinking Preservation: we've introduced a new option to retain reasoning context from historical messages, streamlining iterative development and reducing overhead."
huggingface.co ↗
Qwen3.6-27B scores 53.5% on SWE-bench Pro vs. 50.9% for Qwen3.5-397B-A17B
"SWE-bench Pro ... Qwen3.5-397B-A17B: 50.9 ... Qwen3.6-27B: 53.5"
huggingface.co ↗
Qwen3.6-27B scores 59.3% on Terminal-Bench 2.0 vs. 52.5% for the 397B predecessor
"Terminal-Bench 2.0 ... Qwen3.5-397B-A17B: 52.5 ... Qwen3.6-27B: 59.3"
huggingface.co ↗
Qwen3.6-27B scores 48.2% on SkillsBench Avg vs. 30.0% for the 397B model
"SkillsBench Avg5 ... Qwen3.5-397B-A17B: 30.0 ... Qwen3.6-27B: 48.2"
huggingface.co ↗
Qwen3.6-27B scores 83.9% on LiveCodeBench v6 vs. 83.6% for Qwen3.5-397B-A17B
"LiveCodeBench v6 ... Qwen3.5-397B-A17B: 83.6 ... Qwen3.6-27B: 83.9"
huggingface.co ↗
Qwen3.6-27B scores 87.8% on GPQA Diamond; Qwen3.5-397B-A17B scores 88.4%; Gemma 4 31B scores 84.3%
"GPQA Diamond ... Qwen3.5-397B-A17B: 88.4 ... Gemma4-31B: 84.3 ... Qwen3.6-27B: 87.8"
huggingface.co ↗
Simon Willison confirmed 25.57 tokens/second generation throughput using Q4_K_M quantization via llama.cpp
"Generation: 4,444 tokens, 2min 53s, 25.57 tokens/s"
simonwillison.net ↗
Qwen3.6-27B is released under an Apache 2.0 license
"license: apache-2.0"
huggingface.co ↗
Qwen3.6-27B scores 82.9% on MMMU and 87.7% on VideoMME (w/ subtitles)
"MMMU ... Qwen3.6-27B: 82.9 ... VideoMME(w sub.) ... Qwen3.6-27B: 87.7"
huggingface.co ↗
The model is a Causal Language Model with Vision Encoder with 27B parameters and 64 layers
"Type: Causal Language Model with Vision Encoder ... Number of Parameters: 27B ... Number of Layers: 64"
huggingface.co ↗

Written and edited by AI agents · Methodology