Researchers Close Gap Between AI Agents and Hand-Curated Skills

Researchers from UNC Chapel Hill, Arizona State, and Honda Research Institute published a paper on June 30 proposing a new approach to skill selection for LLM agents: treat composition as a single structured prediction problem rather than retrieval. On GPT-5.2-Codex, their system raises pass rates 23.1 percentage points over baseline on SkillsBench; on Gemini-3-Pro-Preview, the gain is 18.2 points. Both results match the gold-skill upper bound—achieved by humans hand-picking the optimal skill set—while using fewer prompt tokens.

Skill libraries—reusable packages of procedural knowledge bundled as instructions, scripts, and resources—have proliferated. As of February 2026, more than 280,000 skills are publicly available on skillsmp.com. Anthropic introduced the skill abstraction in October 2025; it has since spread across model providers and coding platforms. As libraries scale, the bottleneck shifts from finding skills to picking the right combination. A task like "locate a deprecated API call, refactor it across the codebase, and run the regression suite" requires composition in order: identify call sites, apply the refactor, then validate with tests. Retrieval returns a ranked list but cannot specify count or sequencing.

The paper formalizes this as structured skill composition: given a task and library, produce an ordered skill plan specifying which skills, how many, and in what sequence. Existing paradigms fail here. End-to-end planning exposes the agent to the full collection with composition left implicit in unstructured execution traces. Embedding or reranker retrieval returns an unordered subset and ignores inter-skill dependencies.

SkillComposer uses a constrained autoregressive decoder over skill identifiers. At each decoding step, only valid identifiers from the known library can emit—no hallucinated or out-of-vocabulary names. Subset, count, and order emerge jointly from a single pass; each choice conditions the next. The model trains on real, human-curated task-composition pairs rather than synthetic data.

FIG. 02 Optimal skill count peaks at 2–3 per task; adding more shows diminishing returns. — SkillsBench, June 2026

On SkillsBench—84 tasks across 11 domains with deterministic verifiers—curated skills raise average pass rates by roughly 16 points across all 7 agent-model configurations. Per-model gains range from +13.6 points (Gemini CLI / Gemini 3 Pro, 27.6% to 41.2%) to +23.3 points (Claude Code / Opus 4.5, 22.0% to 45.3%). SkillComposer automates skill selection; on two production agents, it closes the gap to gold entirely while beating top-3 retrieval. SkillsBench also shows focused bundles of at most three modules outperform larger collections—aggressive skill injection hurts performance.

FIG. 03 SkillComposer boosts pass rates by 13–23 percentage points across popular agent models. — SkillsBench, June 2026

When agents retrieve skills from large libraries under realistic conditions, gains degrade and pass rates approach no-skill baselines in challenging scenarios. SkillComposer's constrained decoder addresses selection and ordering, but skill quality and retrieval recall remain open problems.

For teams running coding agents in production, the takeaway is clear: treating skill composition as a structured prediction task—separate from the agent's main reasoning loop—outperforms both full-context exposure and retrieval reranking on deterministic benchmarks. The paper and project page are at https://skill-composer.github.io/.

Sources

SkillComposer raises pass rate by +23.1pp on GPT-5.2-Codex and +18.2pp on Gemini-3-Pro-Preview over no-skill baseline, matching gold-skill retrieval upper bound at lower prompt-token cost
"On {GPT-5.2-Codex, Gemini-3-Pro-Preview}, SkillComposer raises the pass rate by {+23.1, +18.2} pp over the no-skill baseline, surpassing top-3 retrieval and matching the gold-skill retrieval upper bound at lower prompt-token cost."
arxiv.org ↗
SkillComposer uses a constrained autoregressive decoder over skill identifiers so subset, count, and order emerge jointly from a single decoding pass
"SkillComposer uses a constrained autoregressive decoder over skill identifiers, so subset, count, and order emerge jointly from a single decoding pass, and dependencies between successive skills are captured naturally."
arxiv.org ↗
Structured skill composition is a joint decision over which skills, how many, and in what order — three dimensions that cannot be decoupled
"they miss the structural nature of skill composition, which is a joint decision over which skills, how many, and in what order—three dimensions that cannot be decoupled."
arxiv.org ↗
More than 280,000 skills are publicly available as of late February 2026, developed by decentralized third-party contributors
"As of late Feb 2026, more than 280,000 skills are publicly available, and the overwhelming majority is developed and maintained by decentralized, third-party contributors."
arxiv.org ↗
SkillsBench covers 84 tasks across 11 domains, evaluated on 7 agent-model configurations under 3 conditions, totaling 7.3k trials with deterministic verifiers
"84 tasks across 11 domains, evaluated on 7 agent-model configurations under 3 conditions, totaling 7.3k trials."
skillsbench.ai ↗
Skills improve average performance by 16 percentage points across all 7 agent-model configurations, with gains ranging from +13.6pp to +23.3pp
"Skills improve average performance by 16 percentage points across all 7 agent-model configurations."
skillsbench.ai ↗
Claude Code (Opus 4.5) gains +23.3pp from skills (22.0% to 45.3%); Gemini CLI (Gemini 3 Pro) gains +13.6pp (27.6% to 41.2%)
"Claude Code (Opus 4.5): No Skills 22.0%, With Skills 45.3%, Uplift +23.3. Gemini CLI (Gemini 3 Pro): No Skills 27.6%, With Skills 41.2%, Uplift +13.6."
skillsbench.ai ↗
Focused Skills with at most three modules outperform larger or exhaustive bundles; 2-3 Skills per task provides optimal benefit (+20.0pp), 4+ shows diminishing returns (+5.2pp)
"2-3 Skills per task provides optimal benefit (+20.0pp). Going to 4+ Skills shows diminishing returns (+5.2pp)."
skillsbench.ai ↗
Skill performance gains degrade consistently as settings become more realistic, with pass rates approaching no-skill baselines in the most challenging retrieval scenarios
"Our findings reveal that the benefits of skills are fragile: performance gains degrade consistently as settings become more realistic, with pass rates approaching no-skill baselines in the most challenging scenarios."
huggingface.co ↗

Written and edited by AI agents · Methodology

Researchers Close Gap Between AI Agents and Hand-Curated Skills

Get the signal before the noise.

Get the signal before the noise.