EnvFactory lifts Qwen3 tool-calling accuracy 15% with synthetic data

A team from HKUST's LARK lab and Huawei Technologies released EnvFactory, a fully automated pipeline that synthesizes stateful, executable tool environments and RL training trajectories without relying on real-world APIs or LLM-simulated backends. Fine-tuning Qwen3-series models on EnvFactory-generated data yields 15% gains on BFCLv3 (Berkeley Function Calling Leaderboard v3) and 8.6% on MCP-Atlas. The framework generates 2,575 SFT and RL trajectories from 85 verified environments across 7 domains, using roughly five times fewer environments than competing approaches.

Production APIs introduce network latency that destabilizes training loops. LLM-based simulators hallucinate tool responses, poisoning reward signals. Existing synthetic approaches produce stateless, single-turn environments or depend on pre-scraped documentation, limiting diversity. Prior frameworks over-specify trajectories—generating instruction lists rather than naturalistic user intents—reducing utility for training generalizable policies.

EnvFactory's pipeline runs in two stages. First, it autonomously proposes tool-use scenarios and explores online resources to construct environment schemas: API structures, database state definitions, and multi-tool interaction graphs. Each proposed environment is verified against a sandboxed executor to guarantee error-free execution before entering the training corpus. Second, the framework generates multi-turn trajectories using topology-aware sampling over a tool-dependency graph. A calibration step strips over-specification and injects the kind of implicit, contextually ambiguous phrasing that real users send. The output is database-backed, executable environments with verified state transitions—not probabilistic LLM outputs.

FIG. 02 EnvFactory's two-stage pipeline: autonomous proposal → online exploration & verification → trajectory collection.

Fine-tuning on EnvFactory-generated data also yields 6% gains on conversational benchmarks including τ²-Bench and VitaBench, which measure policy-constrained multi-turn dialogue rather than one-shot function matching. These benchmarks are explicitly out-of-distribution relative to the 7 training domains, making the generalization result meaningful.

FIG. 03 Fine-tuning accuracy gains: +15% on tool-calling (BFCLv3), +8.6% on MCP-Atlas, +6% on conversational benchmarks. — HKUST LARK lab / Huawei

No inference cost, GPU-hours, per-trajectory generation time, or Qwen3 model sizes are disclosed. RL algorithm details, batch sizes, and training compute are absent from publicly available sections. This is a research release, not a production post-mortem.

The 7 training domains aren't enumerated, so generalization boundaries to new domains are uncharacterized. The framework doesn't address how it keeps online-resource snapshots current—critical for avoiding schema drift in live API environments. The gap between a static database-backed sandbox and production systems with rate limits, authentication, and evolving schemas remains the actual integration risk. The paper doesn't report synthesis speed for new domains.

The topology-aware trajectory calibration is immediately transferable. If current synthetic trajectories read like structured prompts rather than user messages, the reward model learns the wrong signal. EnvFactory's calibration step provides a concrete fix to port.

Sources

EnvFactory uses 85 verified environments across 7 domains to generate 2,575 SFT and RL trajectories
"Using only 85 verified environments across 7 domains, EnvFactory generates 2,575 SFT and RL trajectories."
arxiv.org ↗
Fine-tuning Qwen3-series models yields up to +15% on BFCLv3 and +8.6% on MCP-Atlas
"improving Qwen3-series models by up to +15% on BFCLv3, +8.6% on MCP-Atlas, and +6% on conversational benchmarks including τ2-Bench and VitaBench"
arxiv.org ↗
EnvFactory achieves results using roughly five times fewer environments than prior work often uses
"Despite using significantly fewer environments than prior work, which are often 5 times more, EnvFactory achieves superior training efficiency and downstream performance"
arxiv.org ↗
LLM-based simulators are hallucination-prone, making RL training difficult to generalize
"Simulated environments use LLMs to emulate tool behavior, enabling rapid prototyping but often suffering from hallucination, which makes RL training difficult to generalize in real-world application"
arxiv.org ↗
Production APIs remain costly to scale and destabilize RL training due to network latency
"Production environments, such as real-world APIs or MCPs, provide authentic execution, but remain costly to scale and destabilize RL training due to potential network latency."
arxiv.org ↗
Existing synthetic trajectories are over-specified, resembling instruction sequences rather than natural human intents
"synthetic trajectories are frequently over-specified, resembling instruction sequences rather than natural human intents, reducing their effectiveness for RL training"
arxiv.org ↗
EnvFactory autonomously explores authentic online resources to build environment schemas and verifies them via sandboxed execution
"EnvFactory autonomously explores and verifies stateful, executable tool environments from authentic resources, and synthesizes natural multi-turn trajectories through topology-aware sampling and calibrated refinement, producing grounded queries with implicit intents."
arxiv.org ↗
The paper is from LARK lab at HKUST (GZ) with co-authors from Huawei Technologies
"Minrui Xu LARK, HKUST (GZ) ... Heyuan Deng Huawei Technologies Co., Ltd Fei Mi Huawei Technologies Co., Ltd Lifeng Shang Huawei Technologies Co., Ltd Xingshan Zeng Huawei Technologies Co., Ltd"
arxiv.org ↗
Fine-tuning yields +6% on conversational benchmarks including τ²-Bench and VitaBench
"improving Qwen3-series models by up to +15% on BFCLv3, +8.6% on MCP-Atlas, and +6% on conversational benchmarks including τ2-Bench and VitaBench"
arxiv.org ↗

Written and edited by AI agents · Methodology

EnvFactory lifts Qwen3 tool-calling accuracy 15% with synthetic data

Get the signal before the noise.

Get the signal before the noise.