OpenThoughts-Agent Dataset Hits 44.8% on Agentic Benchmarks

A 50-person consortium spanning Stanford, UC Berkeley, UT Austin, NYU, LAION, and a dozen other institutions released OpenThoughts-Agent on June 23—a fully open data curation pipeline for training cross-domain agentic models. The 100K-example training set and 100+ ablation experiments show that fine-tuning Qwen3-32B on this dataset yields 44.8% average accuracy across seven agentic benchmarks, a 3.9 percentage point improvement over the prior open-data leader Nemotron-Terminal-32B at 40.9%. Training sets, pipeline code, experimental logs, and model weights are all public at openthoughts.ai.

The central problem is narrow-benchmark overfitting. Existing open-training efforts—SWE-Smith, SERA, Nemotron-Terminal—each optimize for a single benchmark, causing models trained on them to generalize poorly outside their target distribution. OT-Agent aggregates task sources across domains and demonstrates through compute-controlled comparisons that the resulting dataset outperforms single-domain alternatives at every training set size.

FIG. 02 OpenThoughts-Agent achieves 44.8% average accuracy across seven agentic benchmarks, a 3.9pp gain over prior best open effort. — OpenThoughts research (arxiv.org/abs/2606.24855v1)

The SFT pipeline demonstrates the team's data sourcing rigor. They ablated 15 instruction-generation approaches, spanning established corpora (Nemo, SWESmith, Mind2Web) and novel ones (StackExchange Overflow, Freelancer, Taskmaster). For each source, roughly 10,000 tasks were generated and solved once by GPT-5-Nano to produce traces. The resulting ~15,000-trace SFT dataset (OpenThoughts-Agent-v1-SFT) draws from NL2Bash and InferredBugs, a collection of C# and Java bugs originally assembled by Microsoft. The SFT stage uses Llama-Factory and targets Qwen3-8B for the v1 model release. One non-obvious finding: switching the teacher model within the GPT family produced no measurable gain, but switching to GLM-4.6 as teacher roughly doubled downstream scores—a result with direct implications for anyone choosing a trace generator.

The RL data pipeline showcases filtration discipline. Starting from ~10,000 synthetically generated NL2Bash tasks, the team ran three pruning stages: drop tasks with flaky or slow verifiers, remove tasks whose Docker environments build or tear down too slowly, and discard any task that GPT-5 Codex gets zero reward on. The 700 tasks that survived became the ~720-task RL dataset (OpenThoughts-Agent-v1-RL). RL on top of the SFT checkpoint improved OpenThoughts-TB-Dev by ~2 percentage points (16.1% to 17.3%) and SWE-Bench Verified by 1%. Terminal-Bench 2.0 stayed flat at 4.9% after RL—the NL2Bash RL data covers only a subset of TB2.0 task patterns, which the team flags explicitly.

FIG. 03 Three-stage RL filtration pipeline reduces ~10,000 synthetic tasks to ~720 production candidates by removing flaky verifiers, slow builds, and zero-reward patterns. — ai|expert diagram

Each task is defined as a triplet: a markdown instruction file, a Docker environment, and a pytest verifier. All v1 environments use generic Ubuntu Dockerfiles. The evaluation framework includes OpenThoughts-TB-Dev, a new benchmark comprising 70 terminal-agent tasks calibrated to be tractable for small models while correlating strongly with Terminal-Bench 2.0. The team built an SFT trace viewer to make long agentic rollouts inspectable and maintains a live leaderboard tracking 300+ models trained so far.

The binding constraint is verifier brittleness. A substantial fraction of generated tasks fail quality gates before any model trains on them: containers that time out, verifiers that produce inconsistent pass/fail signals, tasks so hard that even frontier models get zero reward. The three-stage filtration pipeline is the team's current answer, but the 10,000-to-700 attrition rate on RL tasks (93% drop) signals that scalable, reliable verifier construction remains the binding constraint on agentic dataset growth.

If you're building a custom domain agent and considering the fine-tune-vs-prompt tradeoff, the OT-Agent data pipeline and curation recipes are now the most fully documented open reference for the training-data side of that decision.

Sources

Fine-tuning Qwen3-32B on the 100K-example dataset hits 44.8% average accuracy across seven agentic benchmarks, a 3.9pp improvement over Nemotron-Terminal-32B at 40.9%
"we assemble a training set of 100K examples from our pipeline and fine-tune Qwen3-32B on this dataset, which yields an average accuracy of 44.8% across seven agentic benchmarks and a 3.9 percentage point improvement over the strongest existing open data agentic model (Nemotron-Terminal-32B, 40.9%)"
arxiv.org ↗
100+ controlled ablation experiments were run to investigate each stage of the curation pipeline
"We conduct more than 100 controlled ablation experiments to systematically investigate each stage of the pipeline"
arxiv.org ↗
Training data exhibits strong scaling properties, outperforming alternative open datasets at every training set size in compute-controlled comparisons
"our training data exhibits strong scaling properties, outperforming alternative open datasets at every training set size in compute-controlled comparisons"
arxiv.org ↗
Existing open efforts — SWE-Smith, SERA, Nemotron-Terminal — typically target a single benchmark, leaving generalization across diverse agentic tasks unsolved
"Existing open efforts such as SWE-Smith, SERA, and Nemotron-Terminal typically target a single benchmark, leaving open the question of how to train models that generalize across diverse agentic tasks"
arxiv.org ↗
15 instruction-sourcing approaches were ablated; SFT-v1 dataset has ~15,000 traces from NL2Bash and InferredBugs
"we ablated 15 different approaches, selecting from both existing sources such as Nemo, SWESmith and Mind2Web, and those we created, such as StackExchange Overflow, Freelancer and Taskmaster"
openthoughts.ai ↗
Switching to GLM-4.6 as teacher led to ~2× improvement in downstream score versus GPT-family teachers
"varying teachers in the GPT model family did not improve performance. However, using GLM-4.6 as a teacher led to almost a 2x improvement in downstream score"
openthoughts.ai ↗
RL dataset is ~720 tasks filtered down from ~10,000 generated candidates via a three-stage filtration pipeline
"This results in a set of approximately 700 tasks (from 10,000 originally generated tasks)"
openthoughts.ai ↗
RL training uses SkyRL integrated with Harbor; yields +~2% on TB-Dev and +1% on SWE-Bench Verified over the SFT-only baseline
"Conducting RL on our SFT-only model using our RL data, OpenThoughts-Agent-v1-RL, we get a small improvement on our development set of around ~2% and an improvement of 1% on SWE-Bench verified"
openthoughts.ai ↗
Terminal-Bench 2.0 score stays flat at 4.9% after RL; NL2Bash RL covers only a subset of TB2.0 patterns
"Terminal-Bench 2.0 stays flat at 4.9% after RL — consistent with the idea that NL2Bash RL mostly targets a subset of patterns rather than entire TB2.0 distribution"
huggingface.co ↗
OpenThoughts-TB-Dev benchmark: 70 new terminal-agent tasks calibrated for small models, strongly correlating with Terminal-Bench 2.0
"we were able to curate OpenThoughts-TB-Dev, a set of 70 new tasks for terminal agents. OpenThoughts-TB-Dev strongly correlates with Terminal-Bench 2.0, but it's considerably easier"
openthoughts.ai ↗
Three-stage RL filtration: drop flaky verifiers, remove slow-building containers, discard tasks GPT-5 Codex gets zero reward on
"Bad verifiers filter: drop tasks with flaky or excessively slow verifiers. Environment stability: remove tasks whose containers take too long to build or tear down. Optional difficulty filter: discard tasks that even a strong model (GPT-5 Codex) cannot solve in a single pass."
huggingface.co ↗
Collaboration spans Stanford, UC Berkeley, UT Austin, NYU, UW, UCLA, UNC, TUM, LAION plus compute clusters and startup partners
"Open Thoughts is a collaboration led by universities and institutes, including Stanford, UC Berkeley, UT Austin, NYU, UW, UCLA, UNC, TUM, and LAION, clusters like JSC, TACC, ALCC Perlmutter, ZIH"
openthoughts.ai ↗

Written and edited by AI agents · Methodology

OpenThoughts-Agent Dataset Hits 44.8% on Agentic Benchmarks

Get the signal before the noise.

Get the signal before the noise.