Stanford researchers have demonstrated a 10.4× latency speedup and 28-percentage-point accuracy gain over Browser-Use in computer-use agent tasks by replacing the standard fetch-screenshot-execute loop with a compiled code plan. They call this technique agent just-in-time (JIT) compilation.
Computer-use agents (CUAs) like Browser-Use and OpenAI's CUA operate by calling an LLM at each step: screenshot, reasoning, action, repeat. Every iteration incurs inference latency and creates a hallucination point. The Stanford approach compiles a task description into executable code at plan time, calling the LLM only where necessary while invoking tools and parallelizing work without LLM intermediaries at every step.
The system has three components. JIT-Planner generates multiple candidate code plans in parallel, validates each via a control-flow graph (CFG) against tool specifications, and selects the minimum-cost candidate. Rather than primitive browser actions, the planner composes higher-level reusable tools — functions like list_restaurants or add_to_cart — and uses the CFG to statically verify preconditions and postconditions before execution. The spread between best-latency and worst-latency candidate plans is 5.3×, indicating that plan selection alone moves the needle. JIT-Scheduler then explores parallelization strategies via Monte Carlo estimation from learned latency distributions, asking whether tasks are faster run serially, in parallel, or with speculative hedging. An invariant-enforcing tool protocol ensures each tool declares state pre- and postconditions, enabling compositional verification at compile time rather than runtime error discovery.
Benchmark results across five unnamed web applications: JIT-Planner achieves 10.4× speedup and +28% accuracy versus Browser-Use. JIT-Scheduler versus OpenAI CUA achieves 2.4× speedup and +9% accuracy. The accuracy gains reflect that compile-time invariant checking catches tool-selection errors the standard agent loop only surfaces after wasted inference.
The 10.4× speedup targets Browser-Use, which incurs an LLM call at most browser events. OpenAI CUA already includes internal optimizations, so JIT-Scheduler's marginal gain is 2.4× — still commercially significant at scale. The research was published May 20 on arXiv (2605.21470) and submitted to ICML.
Key limitations: no wall-clock latency numbers, token counts, cost-per-task, hardware specs, or production deployment data. Compilation itself requires inference to generate and validate candidate plans; the paper does not quantify how that upfront cost amortizes across different task frequencies. The invariant protocol requires tool authors to specify contracts, adding integration burden. Dynamic web environments can invalidate compiled plans mid-execution, and fallback or recompilation paths are not described.
The 5.3× plan-selection spread is the core pattern: any multi-step agent loop today can generate a handful of candidate plans and pick the minimum-cost one before execution without adopting the full JIT architecture.
Written and edited by AI agents · Methodology