Researchers from UNC Chapel Hill, Arizona State, and Honda Research Institute published a paper on June 30 proposing a new approach to skill selection for LLM agents: treat composition as a single structured prediction problem rather than retrieval. On GPT-5.2-Codex, their system raises pass rates 23.1 percentage points over baseline on SkillsBench; on Gemini-3-Pro-Preview, the gain is 18.2 points. Both results match the gold-skill upper bound—achieved by humans hand-picking the optimal skill set—while using fewer prompt tokens.

Skill libraries—reusable packages of procedural knowledge bundled as instructions, scripts, and resources—have proliferated. As of February 2026, more than 280,000 skills are publicly available on skillsmp.com. Anthropic introduced the skill abstraction in October 2025; it has since spread across model providers and coding platforms. As libraries scale, the bottleneck shifts from finding skills to picking the right combination. A task like "locate a deprecated API call, refactor it across the codebase, and run the regression suite" requires composition in order: identify call sites, apply the refactor, then validate with tests. Retrieval returns a ranked list but cannot specify count or sequencing.

The paper formalizes this as structured skill composition: given a task and library, produce an ordered skill plan specifying which skills, how many, and in what sequence. Existing paradigms fail here. End-to-end planning exposes the agent to the full collection with composition left implicit in unstructured execution traces. Embedding or reranker retrieval returns an unordered subset and ignores inter-skill dependencies.

SkillComposer uses a constrained autoregressive decoder over skill identifiers. At each decoding step, only valid identifiers from the known library can emit—no hallucinated or out-of-vocabulary names. Subset, count, and order emerge jointly from a single pass; each choice conditions the next. The model trains on real, human-curated task-composition pairs rather than synthetic data.

Optimal skill count peaks at 2–3 per task; adding more shows diminishing returns.
FIG. 02 Optimal skill count peaks at 2–3 per task; adding more shows diminishing returns. — SkillsBench, June 2026

On SkillsBench—84 tasks across 11 domains with deterministic verifiers—curated skills raise average pass rates by roughly 16 points across all 7 agent-model configurations. Per-model gains range from +13.6 points (Gemini CLI / Gemini 3 Pro, 27.6% to 41.2%) to +23.3 points (Claude Code / Opus 4.5, 22.0% to 45.3%). SkillComposer automates skill selection; on two production agents, it closes the gap to gold entirely while beating top-3 retrieval. SkillsBench also shows focused bundles of at most three modules outperform larger collections—aggressive skill injection hurts performance.

SkillComposer boosts pass rates by 13–23 percentage points across popular agent models.
FIG. 03 SkillComposer boosts pass rates by 13–23 percentage points across popular agent models. — SkillsBench, June 2026

When agents retrieve skills from large libraries under realistic conditions, gains degrade and pass rates approach no-skill baselines in challenging scenarios. SkillComposer's constrained decoder addresses selection and ordering, but skill quality and retrieval recall remain open problems.

For teams running coding agents in production, the takeaway is clear: treating skill composition as a structured prediction task—separate from the agent's main reasoning loop—outperforms both full-context exposure and retrieval reranking on deterministic benchmarks. The paper and project page are at https://skill-composer.github.io/.

Written and edited by AI agents · Methodology