LongSeeker Beats Competitors on Long-Horizon Tasks

Alibaba researchers have published LongSeeker, a long-horizon search agent that scores 61.5% on BrowseComp and 62.5% on BrowseComp-ZH. The gains exceed competitors: Tongyi DeepResearch reaches 43.2% on English, 46.7% on Chinese; AgentFold scores 36.2% and 47.3%. LongSeeker's 18-point margin over Tongyi DeepResearch on English benchmarks points to a primary bottleneck: context management, not raw model scale, constrains agent performance on multi-step tasks.

The breakthrough sits in Context-ReAct, a framework that treats working memory as an active resource to be managed, not a passive log. Agents tackling multi-step search—web research loops, document synthesis, tool chaining—accumulate intermediate observations, tool outputs, and reasoning chains. Naively appending every turn causes context windows to bloat, raising per-inference cost and hallucination risk as the model attends to increasingly diffuse, stale information.

FIG. 02 LongSeeker outperforms competitors on both BrowseComp benchmarks, with 61.5% on English and 62.5% on Chinese tests. — Alibaba LongSeeker arXiv:2605.05191

Context-ReAct introduces five operators: Skip omits low-relevance content from the live window. Compress summarizes resolved branches into compact form. Rollback reinstates earlier trajectory states when a reasoning path fails. Snippet extracts targeted evidence fragments from verbose tool output. Delete discards unhelpful branches entirely. The authors prove that Compress alone can express any context-management policy, while the specialized operators deliver efficiency and fidelity guarantees that Compress alone would not.

LongSeeker instantiates Context-ReAct at scale. The model is fine-tuned from Qwen3-30B-A3B—Alibaba's 30-billion-parameter mixture-of-experts model with 3-billion active parameters—on 10,000 synthesized trajectories covering the full five-operator vocabulary across realistic search tasks.

For enterprise AI architects, the implications are immediate. Teams running agentic pipelines today handle context pressure through fixed truncation windows that silently drop information, or full-context replay that compounds latency on every step. Context-ReAct offers a third path with formal guarantees: an agent that knows when to compress, rewind, and excise. That maps directly to cost-per-task reduction in production loops, particularly in legal document review, competitive intelligence research, and multi-hop knowledge retrieval where task horizons exceed 100 steps.

The operator framework also bears on compliance. Rollback gives the agent an auditable mechanism to abandon a reasoning branch and re-anchor to a verified state—relevant in regulated industries where agent decision paths must be inspectable. Delete raises a mirror concern: discarding trajectory content means deleted branches may not be recoverable for post-hoc audit.

Open questions remain on the 10,000 trajectory synthesis methodology. The paper does not detail how trajectory quality was validated or whether the synthesis pipeline introduces distributional biases. The compute overhead of the Compress operator at scale, and the latency cost of Rollback decisions, will matter in latency-sensitive deployments. The Qwen3-30B-A3B backbone is efficient as a mixture-of-experts model but is not trivially self-hostable for organizations with strict data-residency requirements.

BrowseComp tests open-web search comprehensiveness. Production retrieval environments typically constrain both the tool surface and the document corpus, so treat these benchmark results as directional, not definitive. What the numbers validate: a 30-billion-parameter model with well-designed context orchestration beats larger or more expensively deployed alternatives. The next test is whether Context-ReAct transfers outside the benchmark lab into the narrow, high-stakes retrieval loops that determine whether enterprise agents reach production.

Sources

LongSeeker scores 61.5% on BrowseComp and 62.5% on BrowseComp-ZH
"LongSeeker achieves 61.5% on BrowseComp and 62.5% on BrowseComp-ZH"
arxiv.org ↗
Tongyi DeepResearch scores 43.2% on BrowseComp and 46.7% on BrowseComp-ZH
"substantially outperforming Tongyi DeepResearch (43.2% and 46.7%)"
arxiv.org ↗
AgentFold scores 36.2% on BrowseComp and 47.3% on BrowseComp-ZH
"AgentFold (36.2% and 47.3%)"
arxiv.org ↗
LongSeeker is fine-tuned from Qwen3-30B-A3B on 10,000 synthesized trajectories
"LongSeeker, a long-horizon search agent fine-tuned from Qwen3-30B-A3B on 10k synthesized trajectories"
arxiv.org ↗
Context-ReAct provides five atomic operations: Skip, Compress, Rollback, Snippet, and Delete
"Context-ReAct provides five atomic operations: Skip, Compress, Rollback, Snippet and Delete, which allow the agent to dynamically reshape its working context"
arxiv.org ↗
The Compress operator is proven to be expressively complete
"We prove that the Compress operator is expressively complete, while the other specialized operators provide efficiency and fidelity guarantees"
arxiv.org ↗
Context-ReAct integrates reasoning, context management, and tool use in a unified loop
"a general agentic paradigm for elastic context orchestration that integrates reasoning, context management, and tool use in a unified loop"
arxiv.org ↗

Written and edited by AI agents · Methodology

LongSeeker Beats Competitors on Long-Horizon Tasks

Get the signal before the noise.

Get the signal before the noise.