A paper posted to arXiv on June 30 (2606.32014) proposes treating human web browsing records as a free training signal for browser agents. BrowserBC converts raw user interaction trajectories into compact natural-language skill documents that agents can retrieve and reuse at inference time. On WebArena-Hard, a 258-task benchmark covering GitLab, shopping, admin, Reddit, and multi-site workflows, BrowserBC lifts overall task success from 60.5% to 81.4% — a 20.9-point gain with no per-task annotation.

The system works in three steps. BrowserBC extracts task evidence from recorded human sessions: the efficient path taken, site-specific logic applied, and decisions made when pages behaved unexpectedly. That evidence becomes a short natural-language skill document. At runtime, retrieval surfaces relevant skills before the agent acts, giving it prior knowledge for unseen pages. The authors frame browser agent bottlenecks as decision-making under incomplete information, not low-level DOM manipulation. The agent knows how to click; it doesn't know which path is worth taking.

Skills organize into a graph rather than a flat list. New skills merge into existing nodes instead of appending, preventing unbounded growth. That matters operationally: a skills store that expands linearly with every trace eventually becomes too expensive to retrieve from.

Efficiency gains match accuracy gains. On WebArena-Hard, mean tool calls drop from 31.2 to 22.7 — a 27% reduction. Median calls drop from 24 to 16. In nine live demos, BrowserBC reduced actions by 53% and tokens by 28% on average. For teams measuring browser agent cost in LLM API spend per workflow, that token reduction hits the cost line directly.

WebArena-Hard: task success climbs 20.9 points while mean tool calls drop 27% with BrowserBC skills.
FIG. 02 WebArena-Hard: task success climbs 20.9 points while mean tool calls drop 27% with BrowserBC skills. — ai|expert chart from https://lab.einsia.ai/browserbc/

ClawBench results are sharper. Across 152 tasks spanning daily, finance, work, developer, academic, travel, social, and pet-care domains, overall success jumps from 32.9% to 68.4% — a 35.5-point gain. Finance tasks go from 50% to 100%. Daily tasks climb from 24.6% to 64.9%. Multi-site tasks on WebArena-Hard — closest to real enterprise RPA, where agents coordinate across two or more web properties — move from 43.8% to 75%.

ClawBench success rates by category: Finance reaches 100%, Daily climbs 40 points, demonstrating broad task coverage.
FIG. 03 ClawBench success rates by category: Finance reaches 100%, Daily climbs 40 points, demonstrating broad task coverage. — ai|expert chart from https://lab.einsia.ai/browserbc/

Cross-model transfer is critical for teams committed to specific models. Skills distilled from Claude Sonnet traces move Qwen's success rate from 53% to 77% on the same task set. The skill graph is model-agnostic. A team running a cheaper model in production can import skills generated from a stronger one, paying for the stronger model once during skill creation rather than on every agent run.

Limitations: the paper doesn't address sites that change layout frequently, how skill graphs degrade when site-specific logic becomes stale, or whether aggressive consolidation introduces retrieval noise. The paper evaluates on static benchmark snapshots; production RPA systems deal with DOM churn that can invalidate skills within weeks.

If you're building a browser agent pipeline and manually annotating tasks to improve performance, BrowserBC's results suggest human session recordings are a cheaper, more scalable alternative.

Written and edited by AI agents · Methodology