Model Scale Fails to Predict Extracted Skill Performance

Microsoft Research, Fudan University, and Shanghai Jiao Tong University published a study on model-generated agent skill libraries. They examined three stages: experience generation, skill extraction, and skill consumption across five domains. Finding: a model excelling at skill extraction often performs poorly at consuming those skills, and vice versa. Skill utility is independent of model scale and baseline task performance.

Existing benchmarks (SkillsBench, SWE-Skills-Bench, Skills-in-the-Wild) examine skill consumption only. SkillCraft addresses extraction—distilling skills as executable tool compositions—but does not measure end-to-end performance. This work is the first systematic analysis spanning both stages. The team built SkillLens, an open framework that runs extraction and consumption in a three-stage pipeline. A target agent generates an experience pool. An extractor distills it into a domain-level skill. The skill is applied to held-out test tasks against a no-skill baseline. The extractor and consumer can be different models. Featured extraction methods: Trace2Skill (distills from execution logs) and CoEvoSkills (iteratively refines multi-file skill packages with a co-evolving verifier).

The five domains tested: embodied planning, productivity software, software engineering, web search, and tool calling. The team varied both extracting and consuming models, producing a full pairing matrix. Result: a strong extractor is not a strong consumer. Skill utility shows no correlation with model scale or baseline task performance.

Model-generated skills improved performance on average, but the study documents non-trivial negative transfer. In some cases, adding an extracted skill degraded performance below the no-skill baseline. SWE-Skills-Bench shows low-quality skills harm agent performance. SkillLearnBench found no continual-learning method consistently improves skills across tasks and base models. Scaling to a stronger LLM does not reliably produce better skills.

The team introduced a meta-skill: a skill that guides extraction toward properties tied to downstream utility. The meta-skill consistently improved extracted skill quality across all five domains and substantially reduced negative transfer. Behavior is domain-specific. The paper reports improvement per domain, not a single aggregated metric.

The study omits latency, token costs, context length, and GPU-hours. No production-scale results are reported. The team does not specify experience pool size, extraction calls required, or meta-skill compute overhead. SkillLens has not been validated at production request volumes.

Treat your skill extractor and your skill consumer as independent model selection decisions. Validate the pairing on a held-out split before deploying. Picking on model scale alone will leave negative transfer in your pipeline. SkillLens is the reference harness for that validation.

Sources

Joint team from Microsoft Research, Fudan University, and Shanghai Jiao Tong University; code at aka.ms/SkillLens
"Correspondence: yifanyang@microsoft.com, zhengxq@fudan.edu.cn Code: https://aka.ms/SkillLens"
arxiv.org ↗
Five domains covered: embodied planning, productivity software, software engineering, web search, and tool calling
"We instantiate this pipeline across five domains, spanning embodied planning, productivity software, software engineering, web search, and tool calling, and systematically vary the extractor and target."
arxiv.org ↗
Model-generated skills are beneficial on average but exhibit non-trivial negative transfer
"We find that model-generated skills are beneficial on average but exhibit non-trivial negative transfer"
arxiv.org ↗
A model can be a strong extractor yet a weak consumer, or vice versa; skill utility independent of model scale or baseline task strength
"A model can be a strong extractor yet a weak consumer, or vice versa, with skill utility independent of model scale or baseline task strength."
arxiv.org ↗
Meta-skill consistently improves skill quality across domains and substantially reduces negative transfer
"we translate these findings into a concrete meta-skill that guides skill extraction toward the features tied to actual utility, which consistently improves skill quality across domains and substantially reduces negative transfer."
arxiv.org ↗
Prior consumption-only benchmarks are SkillsBench, SWE-Skills-Bench, and Skills-in-the-Wild; SkillCraft is a separate partial attempt at extraction with notable limitations
"Most existing efforts study only the skill consumption stage... SkillsBench uses task-seeded, human-authored skills, while SWE-Skills-Bench and Skills-in-the-Wild draw skills from existing public skill repositories instead—all leaving the skill extraction stage outside the loop."
arxiv.org ↗
Trace2Skill distills skills directly from execution logs; CoEvoSkills iteratively refines multi-file skill packages with a co-evolving verifier
"featured works either directly distilling them from execution logs as in Trace2Skill, or iteratively refining multi-file skill packages with a co-evolving verifier as in CoEvoSkills"
arxiv.org ↗
Extraction framework uses minimal design to reflect the extractor's own ability rather than scaffolding tricks
"an extractor then distills this pool into a single domain-level skill through an extraction framework with minimal design, reflecting the extractor's own ability rather than scaffolding tricks"
arxiv.org ↗
Low-quality skills can significantly degrade agent performance rather than improve it (SWE-Skills-Bench finding)
"SWE-Skills-Bench further demonstrates that low-quality skills can significantly degrade agent performance rather than improve it."
arxiv.org ↗
No continual-learning method for skill generation leads consistently across tasks and LLMs; scaling to stronger LLMs does not reliably help
"all continual learning methods improve over the no-skill baseline, yet consistent gains remain elusive: no method leads across all tasks and LLMs, and scaling to stronger LLMs does not reliably help."
arxiv.org ↗

Written and edited by AI agents · Methodology

Model Scale Fails to Predict Extracted Skill Performance

Get the signal before the noise.

Get the signal before the noise.