AMD HyLo Converts Transformer Checkpoints to 32x Longer Context Without Retraining

Researchers from AMD and collaborating institutions published HyLo (HYbrid LOng-context), a post-training recipe that converts pretrained Transformer checkpoints into hybrid architectures capable of handling context windows up to 32 times longer — without retraining from scratch.

The problem HyLo addresses is structural: standard Transformer attention scales quadratically with sequence length, making multi-hundred-thousand-token contexts memory-prohibitive in production. The dominant workaround — retraining a new model — is cost-prohibitive for most enterprises, which hold significant investments in fine-tuned Transformer checkpoints. HyLo's upcycling approach treats those checkpoints as the starting point rather than a sunk cost.

The technique combines three components. First, selected attention layers are replaced with linear sequence-modeling blocks — either Mamba2 or Gated DeltaNet — while remaining layers are converted to Multi-Head Latent Attention (MLA), the low-rank KV projection architecture popularized by DeepSeek. Second, the model undergoes staged long-context training that progressively extends the sequence length. Third, teacher-guided distillation stabilizes the optimization, preventing the architectural surgery from degrading short-context performance. The resulting hybrid model retains the original model's capability profile on standard benchmarks while acquiring long-context competence the base Transformer never had.

FIG. 02 HyLo upcycling pipeline: three steps convert any pretrained Transformer checkpoint into a long-context hybrid model without full retraining. — AMD / HyLo paper, arXiv 2604.24715

The infrastructure numbers matter most to AI platform teams. HyLo reduces KV-cache memory by more than 90% compared to standard Transformer attention, and in the team's vLLM inference stack, HyLo models handle 2 million-token prefill and decoding. Comparable Llama baselines run out of memory at 64K context — making the effective context headroom roughly 31 times greater at the hardware level. On the RULER long-context evaluation benchmark, HyLo consistently outperforms state-of-the-art upcycled hybrid baselines at both 1B and 3B parameter scales, tested against Llama- and Qwen-based variants.

FIG. 03 HyLo cuts KV-cache memory by over 90% while extending usable context from 64 K to 2 M tokens vs. standard Llama baselines. — AMD / HyLo paper, arXiv 2604.24715

The training efficiency data reinforces the case. HyLo-Qwen-1.7B, trained on 10 billion tokens after upcycling, outperforms JetNemotron — an Nvidia hybrid baseline trained on 400 billion tokens — on GSM8K mathematical reasoning, LM-Harness commonsense reasoning, and RULER-64K long-context evaluations. That is a 40× token-budget advantage for comparable or superior task performance. For enterprises calculating the cost of extending deployed models' context capability, the compute arbitrage is concrete.

For AI architects, the standard trade-off between context length and retraining cost now has a third option. Any team standardized on a Transformer-based foundation model — Llama, Qwen, or similar — can evaluate HyLo as a migration path to hybrid architecture without discarding existing fine-tune work. The vLLM integration path means the inference stack change is incremental, not a platform replacement. KV-cache savings of this magnitude also directly affect GPU memory allocation planning: workloads currently requiring dedicated high-memory instances (A100 80GB, H100) to maintain long-session state can shift to smaller footprints.

The caveats are real. The published results top out at 3B parameters; whether the distillation stability holds at 7B, 13B, or 70B scales is unverified. The RULER benchmark, while standard for long-context evaluation, does not fully capture production retrieval tasks like multi-document reasoning over heterogeneous corpora. The Mamba2 and Gated DeltaNet blocks also introduce new kernel dependencies that may conflict with existing custom CUDA or Triton work in hardened inference pipelines.

The paper covers 1B-to-3B scale, and scaling laws for hybrid upcycling remain an open research question. But the 10B-token training budget for competitive performance is a hard data point: teams waiting on long-context hybrid models to mature enough for production evaluation no longer have that excuse.

Sources

HyLo extends usable context length by up to 32× through efficient post-training
"HyLo extends usable context length by up to 32× through efficient post-training"
arxiv.org ↗
HyLo reduces KV-cache memory by more than 90%
"reduces KV-cache memory by more than 90%"
arxiv.org ↗
HyLo enables up to 2M-token prefill and decoding in vLLM inference stack
"enabling up to 2M-token prefill and decoding in our vLLM inference stack"
arxiv.org ↗
Comparable Llama baselines run out of memory beyond 64K context
"while comparable Llama baselines run out of memory beyond 64K context"
arxiv.org ↗
HyLo combines Multi-Head Latent Attention (MLA) and linear blocks (Mamba2 or Gated DeltaNet), with staged long-context training and teacher-guided distillation
"combines architectural adaptation with efficient Transformer blocks, Multi-Head Latent Attention (MLA), and linear blocks (Mamba2 or Gated DeltaNet), together with staged long-context training and teacher-guided distillation for stable optimization"
arxiv.org ↗
HyLo-Qwen-1.7B trained on only 10B tokens significantly outperforms JetNemotron (trained on 400B tokens) on GSM8K, LM-Harness commonsense reasoning, and RULER-64K
"HyLo-Qwen-1.7B trained on only 10B tokens significantly outperforms JetNemotron (trained on 400B tokens) on GSM8K, Lm-Harness common sense reasoning and RULER-64K"
arxiv.org ↗
HyLo outperforms state-of-the-art upcycled hybrid baselines on long-context evaluations such as RULER, tested at 1B and 3B scale on Llama- and Qwen-based variants
"Across 1B- and 3B-scale settings (Llama- and Qwen-based variants), HyLo delivers consistently strong short- and long-context performance and significantly outperforms state-of-the-art upcycled hybrid baselines on long-context evaluations such as RULER"
arxiv.org ↗

Written and edited by AI agents · Methodology

AMD HyLo Converts Transformer Checkpoints to 32x Longer Context Without Retraining

Get the signal before the noise.

Get the signal before the noise.