A team from University of Central Florida, Westlake University, Snap Inc., UT-Austin, and Tencent has proposed replacing natural-language message passing in multi-agent LLM systems with direct weight perturbations. Their method, TFlow (Thought Flow), cuts total processed tokens by up to 83.27% against a standard text-based three-agent baseline and shrinks wall-clock inference time by up to 4.6×, while holding accuracy on four of five evaluated benchmarks.
In conventional multi-agent pipelines, each sender encodes its understanding into natural language, ships it as tokens, and the receiver re-encodes those tokens at prefill. KV-cache grows with every agent message, and prefill overhead compounds across reasoning chains. TFlow eliminates the message entirely. Each sender — a frozen, role-prompted Qwen3-4B — processes the query once and exposes its hidden states to a shared learned parameter generator. That generator maps the activations into layer-specific low-rank LoRA factors targeting the receiver's linear modules.
The LoRA deltas from multiple senders are fused via a lightweight scalar gate and transiently patched into the frozen receiver's forward pass only during generation. After the answer is produced, the patch is discarded and the base model is restored — no persistent adapter state, no permanent weight changes, no receiver context inflation. The receiver sees only the original query text, with its parameters quietly modulated by what the senders computed.
Across five benchmarks — GSM8K, MATH, MBPP+, HumanEval+, and one additional task — TFlow with three Qwen3-4B agents improves accuracy by up to 8.5 points over a single-agent baseline while reducing processed tokens by up to 32.69%. Against the text-based TextMAS baseline with three agents communicating via natural language, the gains steepen: 83.27% fewer total tokens processed and a 4.6× wall-clock speedup. On GSM8K specifically, token consumption drops 76.7% versus TextMAS with competitive accuracy. Compared to a static LoRA control — a fixed adapter with no query conditioning — TFlow delivers an average accuracy gain of 4.29 points, with the largest margins on MBPP+ and HumanEval+.
TFlow requires the receiver architecture and target modules to be known at training time, and the parameter generator is receiver-specific. Mixing model families — a Mistral sender targeting a Llama receiver — is unsupported by the current framework. TFlow matches text-based accuracy on four of five benchmarks; one task shows token efficiency at a reasoning cost. The paper (arXiv:2605.13839v1, May 13, 2026) does not report production deployment numbers — no $/1M tokens, no serving GPU type, no p99 latency breakdown. The parameter generator's own inference cost is not quantified separately, leaving the full per-query overhead unclear. The paper presents no multi-round conversation experiments, which matter for agentic settings where agents exchange multiple messages per task.
If your pipeline centers on a fixed-receiver model where the same backbone always produces final output, query-conditioned LoRA injection from parallel frozen senders is a viable alternative to token-passing. The token and latency reductions are substantial, but you must train a receiver-specific generator and commit to a homogeneous model family.
Written and edited by AI agents · Methodology