BrowserBC Lifts Browser Agent Success to 81% Using Human Traces

Researchers propose a scalable method for training browser agents using freeform human interaction traces—the idea is that millions of humans browsing already provide implicit "skill demonstrations" across web tasks. Using skill distillation, agents learn reusable browser skills (form filling, navigation, search) from this unstructured data. For teams building agentic RPA systems or automated testing platforms, this is a low-cost path to scaling agent capabilities without per-task annotation.

A paper posted to arXiv on June 30 (2606.32014) proposes treating human web browsing records as a free training signal for browser agents. BrowserBC converts raw user interaction trajectories into compact natural-language skill documents that agents can retrieve and reuse at inference time. On WebArena-Hard, a 258-task benchmark covering GitLab, shopping, admin, Reddit, and multi-site workflows, BrowserBC lifts overall task success from 60.5% to 81.4% — a 20.9-point gain with no per-task annotation.

The system works in three steps. BrowserBC extracts task evidence from recorded human sessions: the efficient path taken, site-specific logic applied, and decisions made when pages behaved unexpectedly. That evidence becomes a short natural-language skill document. At runtime, retrieval surfaces relevant skills before the agent acts, giving it prior knowledge for unseen pages. The authors frame browser agent bottlenecks as decision-making under incomplete information, not low-level DOM manipulation. The agent knows how to click; it doesn't know which path is worth taking.

Skills organize into a graph rather than a flat list. New skills merge into existing nodes instead of appending, preventing unbounded growth. That matters operationally: a skills store that expands linearly with every trace eventually becomes too expensive to retrieve from.

Efficiency gains match accuracy gains. On WebArena-Hard, mean tool calls drop from 31.2 to 22.7 — a 27% reduction. Median calls drop from 24 to 16. In nine live demos, BrowserBC reduced actions by 53% and tokens by 28% on average. For teams measuring browser agent cost in LLM API spend per workflow, that token reduction hits the cost line directly.

FIG. 02 WebArena-Hard: task success climbs 20.9 points while mean tool calls drop 27% with BrowserBC skills. — ai|expert chart from https://lab.einsia.ai/browserbc/

ClawBench results are sharper. Across 152 tasks spanning daily, finance, work, developer, academic, travel, social, and pet-care domains, overall success jumps from 32.9% to 68.4% — a 35.5-point gain. Finance tasks go from 50% to 100%. Daily tasks climb from 24.6% to 64.9%. Multi-site tasks on WebArena-Hard — closest to real enterprise RPA, where agents coordinate across two or more web properties — move from 43.8% to 75%.

FIG. 03 ClawBench success rates by category: Finance reaches 100%, Daily climbs 40 points, demonstrating broad task coverage. — ai|expert chart from https://lab.einsia.ai/browserbc/

Cross-model transfer is critical for teams committed to specific models. Skills distilled from Claude Sonnet traces move Qwen's success rate from 53% to 77% on the same task set. The skill graph is model-agnostic. A team running a cheaper model in production can import skills generated from a stronger one, paying for the stronger model once during skill creation rather than on every agent run.

Limitations: the paper doesn't address sites that change layout frequently, how skill graphs degrade when site-specific logic becomes stale, or whether aggressive consolidation introduces retrieval noise. The paper evaluates on static benchmark snapshots; production RPA systems deal with DOM churn that can invalidate skills within weeks.

If you're building a browser agent pipeline and manually annotating tasks to improve performance, BrowserBC's results suggest human session recordings are a cheaper, more scalable alternative.

Sources

BrowserBC lifts WebArena-Hard overall task success from 60.5% to 81.4% (+20.9 points) with no per-task annotation
"Overall 60.5 81.4 +20.9"
lab.einsia.ai ↗
The bottleneck for browser agents is decision-making under incomplete information, not low-level operation; the priors agents lack are already implicit in human interaction traces
"We argue that the bottleneck for browser agents is decision-making under incomplete information rather than low-level operation, and that the priors agents lack are already implicit in human interaction traces."
arxiv.org ↗
BrowserBC converts user interaction trajectories into compact natural-language skills that agents can read, retrieve, reuse, and compose directly
"converting user interaction trajectories into compact natural-language skills that agents can read, retrieve, reuse, and compose directly"
arxiv.org ↗
Skills are organized into a skill graph so that growth proceeds through consolidation rather than unbounded accumulation
"We further organize the distilled skills into a skill graph so that growth proceeds through consolidation rather than unbounded accumulation."
arxiv.org ↗
On WebArena-Hard, mean tool calls drop from 31.2 to 22.7 and median from 24 to 16; across 9 live demos, actions reduced 53% and tokens 28%
"Mean WebArena-Hard tool calls drop from 31.2 to 22.7, median calls drop from 24 to 16... BrowserBC reduces actions by 53% and tokens by 28% on average."
lab.einsia.ai ↗
ClawBench overall success rises from 32.9% to 68.4% (+35.5 points); Finance tasks go from 50% to 100%, Daily from 24.6% to 64.9%
"Overall 32.9 68.4 +35.5 Daily 24.6 64.9 +40.3 Finance 50.0 100.0 +50.0"
lab.einsia.ai ↗
Multi-site tasks on WebArena-Hard move from 43.8% to 75.0% (+31.2 points)
"Multi-site 43.8 75.0 +31.2"
lab.einsia.ai ↗
Sonnet-distilled skills lift Qwen's success rate from 53% to 77%, confirming cross-model skill transfer
"Sonnet-distilled skills lift Qwen from 53% to 77%"
lab.einsia.ai ↗

Written and edited by AI agents · Methodology

BrowserBC Lifts Browser Agent Success to 81% Using Human Traces

Get the signal before the noise.

Get the signal before the noise.