OpenSeeker-v2 beats Alibaba's Tongyi on agentic search benchmarks

OpenSeeker-v2, a 30B open-source search agent built by a research team, outperforms Tongyi DeepResearch — Alibaba's proprietary model trained with full continual pre-training, supervised fine-tuning, and reinforcement learning — across all four agentic search benchmarks, using only supervised fine-tuning on 10,600 curated examples.

OpenSeeker-v2 scores 46.0% on BrowseComp, 58.1% on BrowseComp-ZH, 34.6% on Humanity's Last Exam, and 78.0% on xbench, versus Tongyi's 43.4%, 46.7%, 32.9%, and 75.0%. The margin holds across English and Chinese retrieval tasks and a factual-reasoning benchmark notorious for demanding genuine reasoning chains.

FIG. 02 OpenSeeker-v2 benchmark scores vs Tongyi DeepResearch. OpenSeeker-v2 surpasses Tongyi on BrowseComp (46.0% vs 43.4%) and scores highest on composite reasoning tests. — OpenSeeker-v2 arxiv 2605.04036

The team, led by researchers including Yuwen Du and Siheng Chen, attributes the gains to data quality over compute or training complexity. Their synthesis pipeline makes three modifications: they scale knowledge graph size to force multi-hop exploration during data generation, expand the agent's available tool set to cover broader scenarios, and apply strict low-step filtering to eliminate trajectories that solve tasks via shortcuts rather than reasoning. The result is a training set where every example is both informative and high-difficulty.

For enterprise architects evaluating agentic search stacks, the implications are direct. First, the model weights are open-sourced, so organizations can fine-tune and self-host a benchmark-leading search agent without negotiating API access or routing queries to proprietary clouds. Second, the SFT-only training path is reproducible on academic hardware; teams do not need the RL infrastructure now treated as essential for frontier agent development. Third, the 10.6k-point dataset size suggests proprietary internal data — enterprise knowledge graphs, tool catalogs, internal documentation — could substitute for the published pipeline, yielding agents tuned to specific operational domains.

The cost delta compounds at scale. Running a 30B model on-premise is infrastructure-intensive but is firmly within the envelope of hardware already deployed by large enterprises for inference workloads. The alternative — routing agentic search queries through a frontier proprietary API — carries per-query costs and data egress exposure that mount quickly. OpenSeeker-v2 reframes the build-vs-buy calculus: the capability threshold that required industrial training resources is now achievable with curated data and a single-stage training run.

Several caveats apply. The benchmarks cover search and retrieval-heavy reasoning; generalization to broader agentic tasks — code execution, long-horizon planning, tool-use beyond web search — is not demonstrated. The 30B parameter count demands meaningful serving infrastructure. BrowseComp, a headline benchmark, is recent; its adversarial properties and correlation to real-world search quality remain under community scrutiny.

A purely academic team without access to the continual pre-training or reinforcement learning stages that define the industry recipe has produced a model that beats the recipe at its own benchmarks. The bottleneck was never the algorithm — it was the data.

Sources

OpenSeeker-v2 scores 46.0% on BrowseComp, surpassing Tongyi DeepResearch's 43.4%
"46.0% on BrowseComp, 58.1% on BrowseComp-ZH, 34.6% on Humanity's Last Exam, and 78.0% on xbench, surpassing even Tongyi DeepResearch trained with heavy CPT+SFT+RL pipeline, which achieves 43.4%, 46.7%, 32.9%, and 75.0%, respectively"
arxiv.org ↗
OpenSeeker-v2 trained on 10,600 data points using SFT only
"Trained on merely 10.6k data points, our OpenSeeker-v2 achieves state-of-the-art performance across 4 benchmarks"
arxiv.org ↗
Three data synthesis modifications: scaling knowledge graph size, expanding tool set, strict low-step filtering
"By introducing three simple data synthesis modifications: scaling knowledge graph size for richer exploration, expanding the tool set size for broader functionality, and strict low-step filtering"
arxiv.org ↗
First SOTA search agent at 30B scale developed by a purely academic team using only SFT
"OpenSeeker-v2 represents the first state-of-the-art search agent within its model scale and paradigm to be developed by a purely academic team using only SFT"
arxiv.org ↗
Tongyi DeepResearch uses a CPT+SFT+RL training pipeline
"The typical industry recipe involves a highly resource-intensive pipeline spanning pre-training, continual pre-training (CPT), supervised fine-tuning (SFT), and reinforcement learning (RL)"
arxiv.org ↗
Model weights are being open-sourced by the team
"We are excited to open-source the OpenSeeker-v2 model weights and share our simple yet effective findings to make frontier search agent research more accessible to the community"
arxiv.org ↗

Written and edited by AI agents · Methodology

OpenSeeker-v2 beats Alibaba's Tongyi on agentic search benchmarks

Get the signal before the noise.

Get the signal before the noise.