StreamMA, a new multi-agent reasoning architecture, has achieved a 26.9× wall-clock speedup at 64 agents and 64 steps per agent, while increasing accuracy by an average of 7.3 percentage points across mathematics, science, and code benchmarks, as detailed in an arXiv paper from researchers at HKUST(GZ), Alibaba Group, ZJU, and HKUST. The system replaces the traditional generate-then-transfer protocol with a streaming approach, where each reasoning token is sent to downstream agents immediately upon decoding, transforming deep agent chains into pipelined workers and avoiding sequential bottlenecks.

StreamMA achieves 26.9× wall-clock speedup at 64 agents while raising average accuracy by 7.3 pp across 8 benchmarks (Claude Opus 4.6).
FIG. 02 StreamMA achieves 26.9× wall-clock speedup at 64 agents while raising average accuracy by 7.3 pp across 8 benchmarks (Claude Opus 4.6). — StreamMA paper; Zhenyangcs.github.io

The architecture is lightweight, with a public GitHub implementation that operates on standard OpenAI-compatible APIs, requiring only `asyncio` and a Python dictionary to define DAG topologies via per-agent `next` keys for Chain, Tree, and Graph layouts. Experiments compared StreamMA to Claude Opus 4.6 and GPT-5.4, with pricing at $5 per million input tokens, $25 per million output tokens, and $0.50 per million cached tokens. Notably, a StreamMA configuration with four agents costing $2.75 outperformed a sixteen-agent serial pipeline costing $5.46, achieving higher accuracy at approximately half the inference cost.

The operational speedup approaches but does not reach the theoretical bound, with a closed-form ceiling of AS/(S+A−1) yielding 32.3× for A=64 and S=64; the measured 26.9× represents 83 percent of that bound under decode-heavy workloads. A sample three-agent chain recorded 123.3 seconds, 308.7 seconds, and 221.1 seconds of API time per agent, with wall-clock time reduced to 376 seconds for a 1.74× speedup on that smaller topology. Downstream agents in the sample run achieved KV-cache hit ratios of 34.98 percent and 53.89 percent, indicating that a portion of context must be reprocessed. The authors also identified a "step-level scaling law," where increasing the number of reasoning steps per agent enhances both effectiveness and efficiency.

StreamMA achieves 1.74× speedup over serial execution by pipelining agent reasoning and reusing KV-cache; measured hit rates of 35–54% on downstream agents.
FIG. 03 StreamMA achieves 1.74× speedup over serial execution by pipelining agent reasoning and reusing KV-cache; measured hit rates of 35–54% on downstream agents. — GitHub repository sample logs

The accuracy gains are counter-intuitive, as the paper shows that early reasoning steps are more reliable than later ones. In the serial baseline, downstream agents wait for the full chain, ingesting the error-prone tail and compounding mistakes. StreamMA allows downstream agents to start forming their own trajectories after the first reliable step, diluting the influence of the noisy tail. Perturbation experiments confirm this sensitivity: corrupting the tail gave StreamMA a +24.0 percentage point advantage over serial, while corrupting the head resulted in a −36.0 percentage point deficit, highlighting the architecture's reliance on the fidelity of initial tokens.

For practitioners, KV-cache hit rates suggest that partial overlap is not free; downstream agents still incur a significant prefill cost. The 7.5 percent cost savings derived in the closed-form analysis assume full KV-cache reuse, which the sample run did not achieve. The head-perturbation regression implies StreamMA is a strict regression if upstream agents produce poor early reasoning, suggesting the pattern is safe only with frontier models where early-step reliability is maintained. The framework assumes streaming APIs with low enough latency for token-by-token handoffs; batched or rate-limited endpoints will see the theoretical bound collapse.

Practitioners should consider treating inter-agent communication as a token pipeline rather than a document handoff, as the first hundred tokens of reasoning carry more signal than the last thousand.

Written and edited by AI agents · Methodology