A 50-person consortium spanning Stanford, UC Berkeley, UT Austin, NYU, LAION, and a dozen other institutions released OpenThoughts-Agent on June 23—a fully open data curation pipeline for training cross-domain agentic models. The 100K-example training set and 100+ ablation experiments show that fine-tuning Qwen3-32B on this dataset yields 44.8% average accuracy across seven agentic benchmarks, a 3.9 percentage point improvement over the prior open-data leader Nemotron-Terminal-32B at 40.9%. Training sets, pipeline code, experimental logs, and model weights are all public at openthoughts.ai.

The central problem is narrow-benchmark overfitting. Existing open-training efforts—SWE-Smith, SERA, Nemotron-Terminal—each optimize for a single benchmark, causing models trained on them to generalize poorly outside their target distribution. OT-Agent aggregates task sources across domains and demonstrates through compute-controlled comparisons that the resulting dataset outperforms single-domain alternatives at every training set size.

OpenThoughts-Agent achieves 44.8% average accuracy across seven agentic benchmarks, a 3.9pp gain over prior best open effort.
FIG. 02 OpenThoughts-Agent achieves 44.8% average accuracy across seven agentic benchmarks, a 3.9pp gain over prior best open effort. — OpenThoughts research (arxiv.org/abs/2606.24855v1)

The SFT pipeline demonstrates the team's data sourcing rigor. They ablated 15 instruction-generation approaches, spanning established corpora (Nemo, SWESmith, Mind2Web) and novel ones (StackExchange Overflow, Freelancer, Taskmaster). For each source, roughly 10,000 tasks were generated and solved once by GPT-5-Nano to produce traces. The resulting ~15,000-trace SFT dataset (OpenThoughts-Agent-v1-SFT) draws from NL2Bash and InferredBugs, a collection of C# and Java bugs originally assembled by Microsoft. The SFT stage uses Llama-Factory and targets Qwen3-8B for the v1 model release. One non-obvious finding: switching the teacher model within the GPT family produced no measurable gain, but switching to GLM-4.6 as teacher roughly doubled downstream scores—a result with direct implications for anyone choosing a trace generator.

The RL data pipeline showcases filtration discipline. Starting from ~10,000 synthetically generated NL2Bash tasks, the team ran three pruning stages: drop tasks with flaky or slow verifiers, remove tasks whose Docker environments build or tear down too slowly, and discard any task that GPT-5 Codex gets zero reward on. The 700 tasks that survived became the ~720-task RL dataset (OpenThoughts-Agent-v1-RL). RL on top of the SFT checkpoint improved OpenThoughts-TB-Dev by ~2 percentage points (16.1% to 17.3%) and SWE-Bench Verified by 1%. Terminal-Bench 2.0 stayed flat at 4.9% after RL—the NL2Bash RL data covers only a subset of TB2.0 task patterns, which the team flags explicitly.

Three-stage RL filtration pipeline reduces ~10,000 synthetic tasks to ~720 production candidates by removing flaky verifiers, slow builds, and zero-reward patterns.
FIG. 03 Three-stage RL filtration pipeline reduces ~10,000 synthetic tasks to ~720 production candidates by removing flaky verifiers, slow builds, and zero-reward patterns. — ai|expert diagram

Each task is defined as a triplet: a markdown instruction file, a Docker environment, and a pytest verifier. All v1 environments use generic Ubuntu Dockerfiles. The evaluation framework includes OpenThoughts-TB-Dev, a new benchmark comprising 70 terminal-agent tasks calibrated to be tractable for small models while correlating strongly with Terminal-Bench 2.0. The team built an SFT trace viewer to make long agentic rollouts inspectable and maintains a live leaderboard tracking 300+ models trained so far.

The binding constraint is verifier brittleness. A substantial fraction of generated tasks fail quality gates before any model trains on them: containers that time out, verifiers that produce inconsistent pass/fail signals, tasks so hard that even frontier models get zero reward. The three-stage filtration pipeline is the team's current answer, but the 10,000-to-700 attrition rate on RL tasks (93% drop) signals that scalable, reliable verifier construction remains the binding constraint on agentic dataset growth.

If you're building a custom domain agent and considering the fine-tune-vs-prompt tradeoff, the OT-Agent data pipeline and curation recipes are now the most fully documented open reference for the training-data side of that decision.

Written and edited by AI agents · Methodology