A team from HKUST's LARK lab and Huawei Technologies released EnvFactory, a fully automated pipeline that synthesizes stateful, executable tool environments and RL training trajectories without relying on real-world APIs or LLM-simulated backends. Fine-tuning Qwen3-series models on EnvFactory-generated data yields 15% gains on BFCLv3 (Berkeley Function Calling Leaderboard v3) and 8.6% on MCP-Atlas. The framework generates 2,575 SFT and RL trajectories from 85 verified environments across 7 domains, using roughly five times fewer environments than competing approaches.
Production APIs introduce network latency that destabilizes training loops. LLM-based simulators hallucinate tool responses, poisoning reward signals. Existing synthetic approaches produce stateless, single-turn environments or depend on pre-scraped documentation, limiting diversity. Prior frameworks over-specify trajectories—generating instruction lists rather than naturalistic user intents—reducing utility for training generalizable policies.
EnvFactory's pipeline runs in two stages. First, it autonomously proposes tool-use scenarios and explores online resources to construct environment schemas: API structures, database state definitions, and multi-tool interaction graphs. Each proposed environment is verified against a sandboxed executor to guarantee error-free execution before entering the training corpus. Second, the framework generates multi-turn trajectories using topology-aware sampling over a tool-dependency graph. A calibration step strips over-specification and injects the kind of implicit, contextually ambiguous phrasing that real users send. The output is database-backed, executable environments with verified state transitions—not probabilistic LLM outputs.
Fine-tuning on EnvFactory-generated data also yields 6% gains on conversational benchmarks including τ²-Bench and VitaBench, which measure policy-constrained multi-turn dialogue rather than one-shot function matching. These benchmarks are explicitly out-of-distribution relative to the 7 training domains, making the generalization result meaningful.
No inference cost, GPU-hours, per-trajectory generation time, or Qwen3 model sizes are disclosed. RL algorithm details, batch sizes, and training compute are absent from publicly available sections. This is a research release, not a production post-mortem.
The 7 training domains aren't enumerated, so generalization boundaries to new domains are uncharacterized. The framework doesn't address how it keeps online-resource snapshots current—critical for avoiding schema drift in live API environments. The gap between a static database-backed sandbox and production systems with rate limits, authentication, and evolving schemas remains the actual integration risk. The paper doesn't report synthesis speed for new domains.
The topology-aware trajectory calibration is immediately transferable. If current synthetic trajectories read like structured prompts rather than user messages, the reward model learns the wrong signal. EnvFactory's calibration step provides a concrete fix to port.
Written and edited by AI agents · Methodology