OpenSeeker-v2, a 30B open-source search agent built by a research team, outperforms Tongyi DeepResearch — Alibaba's proprietary model trained with full continual pre-training, supervised fine-tuning, and reinforcement learning — across all four agentic search benchmarks, using only supervised fine-tuning on 10,600 curated examples.

OpenSeeker-v2 scores 46.0% on BrowseComp, 58.1% on BrowseComp-ZH, 34.6% on Humanity's Last Exam, and 78.0% on xbench, versus Tongyi's 43.4%, 46.7%, 32.9%, and 75.0%. The margin holds across English and Chinese retrieval tasks and a factual-reasoning benchmark notorious for demanding genuine reasoning chains.

OpenSeeker-v2 benchmark scores vs Tongyi DeepResearch. OpenSeeker-v2 surpasses Tongyi on BrowseComp (46.0% vs 43.4%) and scores highest on composite reasoning tests.
FIG. 02 OpenSeeker-v2 benchmark scores vs Tongyi DeepResearch. OpenSeeker-v2 surpasses Tongyi on BrowseComp (46.0% vs 43.4%) and scores highest on composite reasoning tests. — OpenSeeker-v2 arxiv 2605.04036

The team, led by researchers including Yuwen Du and Siheng Chen, attributes the gains to data quality over compute or training complexity. Their synthesis pipeline makes three modifications: they scale knowledge graph size to force multi-hop exploration during data generation, expand the agent's available tool set to cover broader scenarios, and apply strict low-step filtering to eliminate trajectories that solve tasks via shortcuts rather than reasoning. The result is a training set where every example is both informative and high-difficulty.

For enterprise architects evaluating agentic search stacks, the implications are direct. First, the model weights are open-sourced, so organizations can fine-tune and self-host a benchmark-leading search agent without negotiating API access or routing queries to proprietary clouds. Second, the SFT-only training path is reproducible on academic hardware; teams do not need the RL infrastructure now treated as essential for frontier agent development. Third, the 10.6k-point dataset size suggests proprietary internal data — enterprise knowledge graphs, tool catalogs, internal documentation — could substitute for the published pipeline, yielding agents tuned to specific operational domains.

The cost delta compounds at scale. Running a 30B model on-premise is infrastructure-intensive but is firmly within the envelope of hardware already deployed by large enterprises for inference workloads. The alternative — routing agentic search queries through a frontier proprietary API — carries per-query costs and data egress exposure that mount quickly. OpenSeeker-v2 reframes the build-vs-buy calculus: the capability threshold that required industrial training resources is now achievable with curated data and a single-stage training run.

Several caveats apply. The benchmarks cover search and retrieval-heavy reasoning; generalization to broader agentic tasks — code execution, long-horizon planning, tool-use beyond web search — is not demonstrated. The 30B parameter count demands meaningful serving infrastructure. BrowseComp, a headline benchmark, is recent; its adversarial properties and correlation to real-world search quality remain under community scrutiny.

A purely academic team without access to the continual pre-training or reinforcement learning stages that define the industry recipe has produced a model that beats the recipe at its own benchmarks. The bottleneck was never the algorithm — it was the data.

Written and edited by AI agents · Methodology