TFlow cuts multi-agent inference tokens 83% via weight injection

A team from University of Central Florida, Westlake University, Snap Inc., UT-Austin, and Tencent has proposed replacing natural-language message passing in multi-agent LLM systems with direct weight perturbations. Their method, TFlow (Thought Flow), cuts total processed tokens by up to 83.27% against a standard text-based three-agent baseline and shrinks wall-clock inference time by up to 4.6×, while holding accuracy on four of five evaluated benchmarks.

In conventional multi-agent pipelines, each sender encodes its understanding into natural language, ships it as tokens, and the receiver re-encodes those tokens at prefill. KV-cache grows with every agent message, and prefill overhead compounds across reasoning chains. TFlow eliminates the message entirely. Each sender — a frozen, role-prompted Qwen3-4B — processes the query once and exposes its hidden states to a shared learned parameter generator. That generator maps the activations into layer-specific low-rank LoRA factors targeting the receiver's linear modules.

The LoRA deltas from multiple senders are fused via a lightweight scalar gate and transiently patched into the frozen receiver's forward pass only during generation. After the answer is produced, the patch is discarded and the base model is restored — no persistent adapter state, no permanent weight changes, no receiver context inflation. The receiver sees only the original query text, with its parameters quietly modulated by what the senders computed.

FIG. 02 TFlow weight injection architecture: three frozen senders generate low-rank LoRA deltas fused via scalar gate and transiently patched into receiver. — TFlow paper, Tencent et al. 2026

Across five benchmarks — GSM8K, MATH, MBPP+, HumanEval+, and one additional task — TFlow with three Qwen3-4B agents improves accuracy by up to 8.5 points over a single-agent baseline while reducing processed tokens by up to 32.69%. Against the text-based TextMAS baseline with three agents communicating via natural language, the gains steepen: 83.27% fewer total tokens processed and a 4.6× wall-clock speedup. On GSM8K specifically, token consumption drops 76.7% versus TextMAS with competitive accuracy. Compared to a static LoRA control — a fixed adapter with no query conditioning — TFlow delivers an average accuracy gain of 4.29 points, with the largest margins on MBPP+ and HumanEval+.

FIG. 03 TFlow achieves up to 83% token reduction across five benchmarks versus text-based three-agent baseline, with 4.6× wall-clock speedup. — TFlow benchmark results, 2026

TFlow requires the receiver architecture and target modules to be known at training time, and the parameter generator is receiver-specific. Mixing model families — a Mistral sender targeting a Llama receiver — is unsupported by the current framework. TFlow matches text-based accuracy on four of five benchmarks; one task shows token efficiency at a reasoning cost. The paper (arXiv:2605.13839v1, May 13, 2026) does not report production deployment numbers — no $/1M tokens, no serving GPU type, no p99 latency breakdown. The parameter generator's own inference cost is not quantified separately, leaving the full per-query overhead unclear. The paper presents no multi-round conversation experiments, which matter for agentic settings where agents exchange multiple messages per task.

If your pipeline centers on a fixed-receiver model where the same backbone always produces final output, query-conditioned LoRA injection from parallel frozen senders is a viable alternative to token-passing. The token and latency reductions are substantial, but you must train a receiver-specific generator and commit to a homogeneous model family.

Sources

TFlow reduces total processed tokens by up to 83.27% and wall-clock inference time by up to 4.6× versus a text-based three-agent baseline
"it reduces total processed tokens by up to 83.27% and the wall-clock inference time by up to 4.6×, while maintaining competitive accuracy on four of five benchmarks"
arxiv.org ↗
TFlow improves over a standalone receiver by up to 8.5 accuracy points while reducing processed tokens by up to 32.69%
"TFlow improves over a standalone receiver by up to 8.5 accuracy points across five benchmarks while reducing processed tokens by up to 32.69%"
arxiv.org ↗
On GSM8K, TFlow reduces token consumption by 76.7% versus TextMAS with competitive accuracy
"TFlow achieves accuracy competitive with TextMAS while reducing token consumption by 76.7%, substantially surpassing the single-agent baseline in both accuracy and efficiency"
arxiv.org ↗
Senders are frozen Qwen3-4B models; a learned parameter generator maps their hidden states into low-rank LoRA perturbations targeting the receiver's modules
"frozen role-prompted sender agents process the input, and a learned parameter generator maps their internal activations into low-rank LoRA perturbations targeting the receiver's modules"
arxiv.org ↗
TFlow outperforms a static LoRA baseline by 4.29 accuracy points on average, with the largest gains on MBPP+ and HumanEval+
"TFLOW achieves substantially stronger performance, outperforming Static-LoRA by 4.29 points on average, with especially clear gains on more challenging reasoning and code-oriented benchmarks such as MBPP+ and HumanEval+"
arxiv.org ↗
After generation, the LoRA patch is discarded and the base model is restored; each input induces its own temporary parameterization
"After generation, the patch is discarded, ensuring that each input induces its own temporary parameterization and that all subsequent inputs start from the same froz"
arxiv.org ↗
Authors are affiliated with University of Central Florida, Westlake University, Snap Inc., UT-Austin, and Tencent; paper published May 13, 2026
"arXiv:2605.13839v1 [cs.CL] 13 May 2026 · Wenrui Bao University of Central Florida &Huan Wang Westlake University &Jian Wang Snap Inc. Zhangyang Wang UT-Austin &Kai Wang Tencent Hy &Yuzhang Shang University of Central Florida"
arxiv.org ↗

Written and edited by AI agents · Methodology

TFlow cuts multi-agent inference tokens 83% via weight injection

Get the signal before the noise.

Get the signal before the noise.