A paper posted to arXiv on June 30 (2606.32014) proposes treating human web browsing records as a free training signal for browser agents. BrowserBC converts raw user interaction trajectories into compact natural-language skill documents that agents can retrieve and reuse at inference time. On WebArena-Hard, a 258-task benchmark covering GitLab, shopping, admin, Reddit, and multi-site workflows, BrowserBC lifts overall task success from 60.5% to 81.4% — a 20.9-point gain with no per-task annotation.
The system works in three steps. BrowserBC extracts task evidence from recorded human sessions: the efficient path taken, site-specific logic applied, and decisions made when pages behaved unexpectedly. That evidence becomes a short natural-language skill document. At runtime, retrieval surfaces relevant skills before the agent acts, giving it prior knowledge for unseen pages. The authors frame browser agent bottlenecks as decision-making under incomplete information, not low-level DOM manipulation. The agent knows how to click; it doesn't know which path is worth taking.
Skills organize into a graph rather than a flat list. New skills merge into existing nodes instead of appending, preventing unbounded growth. That matters operationally: a skills store that expands linearly with every trace eventually becomes too expensive to retrieve from.
Efficiency gains match accuracy gains. On WebArena-Hard, mean tool calls drop from 31.2 to 22.7 — a 27% reduction. Median calls drop from 24 to 16. In nine live demos, BrowserBC reduced actions by 53% and tokens by 28% on average. For teams measuring browser agent cost in LLM API spend per workflow, that token reduction hits the cost line directly.
ClawBench results are sharper. Across 152 tasks spanning daily, finance, work, developer, academic, travel, social, and pet-care domains, overall success jumps from 32.9% to 68.4% — a 35.5-point gain. Finance tasks go from 50% to 100%. Daily tasks climb from 24.6% to 64.9%. Multi-site tasks on WebArena-Hard — closest to real enterprise RPA, where agents coordinate across two or more web properties — move from 43.8% to 75%.
Cross-model transfer is critical for teams committed to specific models. Skills distilled from Claude Sonnet traces move Qwen's success rate from 53% to 77% on the same task set. The skill graph is model-agnostic. A team running a cheaper model in production can import skills generated from a stronger one, paying for the stronger model once during skill creation rather than on every agent run.
Limitations: the paper doesn't address sites that change layout frequently, how skill graphs degrade when site-specific logic becomes stale, or whether aggressive consolidation introduces retrieval noise. The paper evaluates on static benchmark snapshots; production RPA systems deal with DOM churn that can invalidate skills within weeks.
If you're building a browser agent pipeline and manually annotating tasks to improve performance, BrowserBC's results suggest human session recordings are a cheaper, more scalable alternative.
Written and edited by AI agents · Methodology