StreamMA Cuts Multi-Agent Reasoning Latency 26.9×

StreamMA, a new multi-agent reasoning architecture, has achieved a 26.9× wall-clock speedup at 64 agents and 64 steps per agent, while increasing accuracy by an average of 7.3 percentage points across mathematics, science, and code benchmarks, as detailed in an arXiv paper from researchers at HKUST(GZ), Alibaba Group, ZJU, and HKUST. The system replaces the traditional generate-then-transfer protocol with a streaming approach, where each reasoning token is sent to downstream agents immediately upon decoding, transforming deep agent chains into pipelined workers and avoiding sequential bottlenecks.

FIG. 02 StreamMA achieves 26.9× wall-clock speedup at 64 agents while raising average accuracy by 7.3 pp across 8 benchmarks (Claude Opus 4.6). — StreamMA paper; Zhenyangcs.github.io

The architecture is lightweight, with a public GitHub implementation that operates on standard OpenAI-compatible APIs, requiring only `asyncio` and a Python dictionary to define DAG topologies via per-agent `next` keys for Chain, Tree, and Graph layouts. Experiments compared StreamMA to Claude Opus 4.6 and GPT-5.4, with pricing at $5 per million input tokens, $25 per million output tokens, and $0.50 per million cached tokens. Notably, a StreamMA configuration with four agents costing $2.75 outperformed a sixteen-agent serial pipeline costing $5.46, achieving higher accuracy at approximately half the inference cost.

The operational speedup approaches but does not reach the theoretical bound, with a closed-form ceiling of AS/(S+A−1) yielding 32.3× for A=64 and S=64; the measured 26.9× represents 83 percent of that bound under decode-heavy workloads. A sample three-agent chain recorded 123.3 seconds, 308.7 seconds, and 221.1 seconds of API time per agent, with wall-clock time reduced to 376 seconds for a 1.74× speedup on that smaller topology. Downstream agents in the sample run achieved KV-cache hit ratios of 34.98 percent and 53.89 percent, indicating that a portion of context must be reprocessed. The authors also identified a "step-level scaling law," where increasing the number of reasoning steps per agent enhances both effectiveness and efficiency.

FIG. 03 StreamMA achieves 1.74× speedup over serial execution by pipelining agent reasoning and reusing KV-cache; measured hit rates of 35–54% on downstream agents. — GitHub repository sample logs

The accuracy gains are counter-intuitive, as the paper shows that early reasoning steps are more reliable than later ones. In the serial baseline, downstream agents wait for the full chain, ingesting the error-prone tail and compounding mistakes. StreamMA allows downstream agents to start forming their own trajectories after the first reliable step, diluting the influence of the noisy tail. Perturbation experiments confirm this sensitivity: corrupting the tail gave StreamMA a +24.0 percentage point advantage over serial, while corrupting the head resulted in a −36.0 percentage point deficit, highlighting the architecture's reliance on the fidelity of initial tokens.

For practitioners, KV-cache hit rates suggest that partial overlap is not free; downstream agents still incur a significant prefill cost. The 7.5 percent cost savings derived in the closed-form analysis assume full KV-cache reuse, which the sample run did not achieve. The head-perturbation regression implies StreamMA is a strict regression if upstream agents produce poor early reasoning, suggesting the pattern is safe only with frontier models where early-step reliability is maintained. The framework assumes streaming APIs with low enough latency for token-by-token handoffs; batched or rate-limited endpoints will see the theoretical bound collapse.

Practitioners should consider treating inter-agent communication as a token pipeline rather than a document handoff, as the first hundred tokens of reasoning carry more signal than the last thousand.

Sources

StreamMA achieves 26.9× wall-clock speedup at A=64 agents and S=64 steps per agent, and raises accuracy by an average of 7.3 pp across 8 benchmarks with Claude Opus 4.6
"Across eight reasoning benchmarks spanning mathematics, science, and code, two frontier LLMs (Claude Opus 4.6 and GPT-5.4), and three topologies (Chain, Tree, Graph), StreamMA outperforms both baselines (avg. +7.3 pp, max +22.4 pp on HMMT 2026; Claude Opus 4.6-high)."
arxiv.org ↗
Peak accuracy gain of +22.4 pp on HMMT 2026 with Claude Opus 4.6-high; 26.9× speedup is 83% of theoretical bound; Stream×4 at $2.75 outperforms Serial×16 at $5.46
"26.9× wall-clock speedup A=64, S=64 · 83% of theoretical bound ½ cost Stream×4 beats Serial×16 $2.75 vs $5.46 · higher accuracy at half the price"
zhenyangcs.github.io ↗
Claude Opus 4.6 API pricing used in experiments: $5/$25/$0.50 per MTok (input/output/cache); tail-corruption gives Stream +24.0 pp; head-corruption gives Stream −36.0 pp
"Claude Opus 4.6 pricing: $5 / $25 / $0.5 per MTok (input / output / cache). Tail-perturbed → Stream up to +24.0 pp Head-perturbed → Stream down to −36.0 pp"
zhenyangcs.github.io ↗
Theoretical speedup ceiling is AS/(S+A−1); at A=64, S=64 this is 32.3×; measured 26.9× is 83% of that bound
"S=64, A=64 → 32.3× theoretical, 26.9× measured (83%)."
zhenyangcs.github.io ↗
Sample 3-agent chain logged API times of 123.3s, 308.7s, 221.1s per agent; wall time 376s; speedup 1.74×; KV-cache hit ratios 34.98% and 53.89% for downstream agents
"Agent_A: api_time: 123.30 ... Agent_B: kv_cache_hit_ratio: 0.3498, api_time: 308.74 ... Agent_C: kv_cache_hit_ratio: 0.5389, api_time: 221.08 ... wall_time: 376.02, speedup: 1.74"
github.com ↗
StreamMA code is publicly available; runs on any OpenAI-compatible API; topology declared as a Python dict with per-agent `next` keys
"pip install openai python StreamMA.py ... The DAG is fully driven by the config dict — change the topology by editing each agent's next: [...]"
github.com ↗

Written and edited by AI agents · Methodology

StreamMA Cuts Multi-Agent Reasoning Latency 26.9×

Get the signal before the noise.

Get the signal before the noise.