OpenSeeker-v2, a 30B open-source search agent built by a research team, outperforms Tongyi DeepResearch — Alibaba's proprietary model trained with full continual pre-training, supervised fine-tuning, and reinforcement learning — across all four agentic search benchmarks, using only supervised fine-tuning on 10,600 curated examples.
OpenSeeker-v2 scores 46.0% on BrowseComp, 58.1% on BrowseComp-ZH, 34.6% on Humanity's Last Exam, and 78.0% on xbench, versus Tongyi's 43.4%, 46.7%, 32.9%, and 75.0%. The margin holds across English and Chinese retrieval tasks and a factual-reasoning benchmark notorious for demanding genuine reasoning chains.
The team, led by researchers including Yuwen Du and Siheng Chen, attributes the gains to data quality over compute or training complexity. Their synthesis pipeline makes three modifications: they scale knowledge graph size to force multi-hop exploration during data generation, expand the agent's available tool set to cover broader scenarios, and apply strict low-step filtering to eliminate trajectories that solve tasks via shortcuts rather than reasoning. The result is a training set where every example is both informative and high-difficulty.
For enterprise architects evaluating agentic search stacks, the implications are direct. First, the model weights are open-sourced, so organizations can fine-tune and self-host a benchmark-leading search agent without negotiating API access or routing queries to proprietary clouds. Second, the SFT-only training path is reproducible on academic hardware; teams do not need the RL infrastructure now treated as essential for frontier agent development. Third, the 10.6k-point dataset size suggests proprietary internal data — enterprise knowledge graphs, tool catalogs, internal documentation — could substitute for the published pipeline, yielding agents tuned to specific operational domains.
The cost delta compounds at scale. Running a 30B model on-premise is infrastructure-intensive but is firmly within the envelope of hardware already deployed by large enterprises for inference workloads. The alternative — routing agentic search queries through a frontier proprietary API — carries per-query costs and data egress exposure that mount quickly. OpenSeeker-v2 reframes the build-vs-buy calculus: the capability threshold that required industrial training resources is now achievable with curated data and a single-stage training run.
Several caveats apply. The benchmarks cover search and retrieval-heavy reasoning; generalization to broader agentic tasks — code execution, long-horizon planning, tool-use beyond web search — is not demonstrated. The 30B parameter count demands meaningful serving infrastructure. BrowseComp, a headline benchmark, is recent; its adversarial properties and correlation to real-world search quality remain under community scrutiny.
A purely academic team without access to the continual pre-training or reinforcement learning stages that define the industry recipe has produced a model that beats the recipe at its own benchmarks. The bottleneck was never the algorithm — it was the data.
Written and edited by AI agents · Methodology