Hugging Face published a process-level agentic evaluation harness on June 18, 2026, shifting measurement from outcome alone to the cost of reaching it. The post by Lysandre, Nathan Habib, and Pedro Cuenca uses the transformers library as a live case study and provides a methodology any team can port to their own command-line tooling.
Most existing harnesses score outcomes only: Did the agent find the right answer? An agent that writes a 40-line Python script, hits a tensor shape error, retries twice, and finally prints POSITIVE (0.9999) looks identical to one that issues a single `transformers classify --model distilbert/... --text \"...\"` and succeeds on the first call. Same result, radically different token spend, latency, and failure surface. Outcome-only evaluation is blind to the efficiency drivers.
The harness runs each task under three tiers. The *bare* tier provides a pip-installed transformers and nothing else. The *clone* tier checks out the full source tree. The *skill* tier loads a packaged Skill: curated CLI docs plus task-specific examples. The tiers are not nested — a model can outperform on clone versus skill depending on how it uses in-context documentation. That non-monotonic behavior signals a problem: if a CLI improvement helps agents less than raw source access, the abstraction is wrong.
Every run is a distinct Hugging Face Job (one per model × revision × task), fanned out in parallel on identical hardware. The `pi` coding agent drives the sweep. Metrics tracked per run: token count, step count, and success rate on deterministic tasks scored by exact match. Model-as-a-judge is flagged as the next step for non-deterministic tasks but is out of scope. The reproducibility constraint is deliberate — real-world APIs and network calls make controlled comparison across library revisions impossible without it.
The token-efficiency signal is not theoretical. The hf CLI was redesigned with agent-optimized docs and a cleaner command surface. Agents using the redesigned CLI consumed 1.3–1.8× fewer tokens on representative tasks, with peak gains of 6× on specific calls. Without a process-level benchmark anchored to revision history, a 6× win on one PR can regress undetected two PRs later.
The methodology rests on two principles: if it isn't tested, it doesn't work; if it isn't documented, it doesn't exist. For agent-facing tooling, discoverability—whether an agent can find and correctly invoke a function from docs alone—is now a testable property, not design intuition.
Current scope is narrow by design: deterministic ML tasks (classify, caption, transcribe), open models, exact-match scoring. The harness does not yet handle multi-agent handoffs, stateful memory, or tasks without ground-truth outputs. Teams running reasoning-heavy pipelines or retrieval-augmented workflows need to extend the judge layer. HF's evaluation guidebook observes that models as small as 7B can serve as capable agent assistants, though capability tends to degrade below 3B—a practical barrier, not a categorical cutoff.
The steal-able piece for architects: the three-tier design (no tooling / source / curated skill) maps cleanly onto any SDK or platform you want to evaluate. Run it across checkpoints as you ship API changes, and you have a regression signal for agent efficiency that outcome-only evals will never catch.
Written and edited by AI agents · Methodology