Researchers from Yale NLP Lab, the University of Pennsylvania, and UNC Chapel Hill released OpenComputer, a verifier-grounded framework for building machine-checkable desktop evaluation environments for computer-use agents. The benchmark ships with 1,000 finalized tasks spanning 33 applications — browsers, office suites, creative software, development environments, file managers, and communication tools — and is open-sourced at github.com/echo0715/OpenComputer.
LLM-as-a-judge evaluation is inadequate for desktop agents. LLM judges are sensitive to prompt wording and incomplete observations, difficult to audit across runs, and reward outcomes that look plausible in screenshots while missing errors buried in application state. OpenComputer replaces judge-based scoring with four tightly coupled components: app-specific state verifiers that expose structured inspection endpoints over real applications, a self-evolving verification layer that iterates verifier reliability using execution-grounded feedback, a task-generation pipeline that synthesizes realistic and machine-checkable task instances, and an evaluation harness that records full agent trajectories and computes auditable partial-credit rewards.
The self-evolving verifier loop operates in calibration. Phase 2 of the pipeline runs calibration tasks, lets a strong agent execute them, then pits an LLM evaluator against the programmatic verifier. Where the two disagree, the system attributes the discrepancy and writes the attribution back into verifier memory — patching the endpoint, the checker, or the documentation. Verifier reliability improves without hand-labeling of new ground truth.
Hard-coded verifiers showed closer alignment with human adjudication than LLM-as-judge scoring, especially on tasks where success depends on fine-grained application state rather than visible UI output. Frontier models struggled with end-to-end task completion despite accumulating partial credit — consistent with agent benchmarks generally, but OpenComputer's partial-credit reward structure makes the gap more visible than binary pass/fail scoring. Open-source models exhibited sharp score drops relative to their OSWorld-Verified numbers, suggesting the transfer from OSWorld's 369-task corpus to OpenComputer's 1,000-task, 33-app spread is non-trivial. The paper does not disclose specific pass rates per model or per application category.
OpenComputer is a research framework and evaluation harness, not a shipping inference product. No latency, cost-per-task, or GPU-hours-to-evaluate figures were disclosed. Teams adopting this framework must budget for maintaining live application state across 33 desktop apps — creating or editing files, configuring folders, populating spreadsheets, seeding email or calendar state, and ensuring reproducibility across VM snapshots. This mirrors the pain point OSWorld teams have flagged repeatedly. OpenComputer's task-generation pipeline aims to automate task synthesis, but verifier maintenance burden shifts rather than disappears.
Applications ship updates; a state-inspection endpoint that worked on LibreOffice 24.x may silently fail on 25.x. The self-evolving layer addresses this in principle, but continuous re-validation is required as application versions change. The partial-credit reward structure will matter for RL training pipelines. If teams intend to use OpenComputer as a training signal rather than just an eval harness, reward shaping choices equal verifier accuracy in importance.
If you ship a computer-use agent and rely on LLM judges for eval, lift OpenComputer's verifier-grounded partial-credit pattern. Build state-inspection endpoints first, run the self-evolving calibration loop before deploying tasks at scale, and treat eval infrastructure like production code under version control.
Written and edited by AI agents · Methodology