Microsoft Research, Fudan University, and Shanghai Jiao Tong University published a study on model-generated agent skill libraries. They examined three stages: experience generation, skill extraction, and skill consumption across five domains. Finding: a model excelling at skill extraction often performs poorly at consuming those skills, and vice versa. Skill utility is independent of model scale and baseline task performance.

Existing benchmarks (SkillsBench, SWE-Skills-Bench, Skills-in-the-Wild) examine skill consumption only. SkillCraft addresses extraction—distilling skills as executable tool compositions—but does not measure end-to-end performance. This work is the first systematic analysis spanning both stages. The team built SkillLens, an open framework that runs extraction and consumption in a three-stage pipeline. A target agent generates an experience pool. An extractor distills it into a domain-level skill. The skill is applied to held-out test tasks against a no-skill baseline. The extractor and consumer can be different models. Featured extraction methods: Trace2Skill (distills from execution logs) and CoEvoSkills (iteratively refines multi-file skill packages with a co-evolving verifier).

The five domains tested: embodied planning, productivity software, software engineering, web search, and tool calling. The team varied both extracting and consuming models, producing a full pairing matrix. Result: a strong extractor is not a strong consumer. Skill utility shows no correlation with model scale or baseline task performance.

Model-generated skills improved performance on average, but the study documents non-trivial negative transfer. In some cases, adding an extracted skill degraded performance below the no-skill baseline. SWE-Skills-Bench shows low-quality skills harm agent performance. SkillLearnBench found no continual-learning method consistently improves skills across tasks and base models. Scaling to a stronger LLM does not reliably produce better skills.

The team introduced a meta-skill: a skill that guides extraction toward properties tied to downstream utility. The meta-skill consistently improved extracted skill quality across all five domains and substantially reduced negative transfer. Behavior is domain-specific. The paper reports improvement per domain, not a single aggregated metric.

The study omits latency, token costs, context length, and GPU-hours. No production-scale results are reported. The team does not specify experience pool size, extraction calls required, or meta-skill compute overhead. SkillLens has not been validated at production request volumes.

Treat your skill extractor and your skill consumer as independent model selection decisions. Validate the pairing on a held-out split before deploying. Picking on model scale alone will leave negative transfer in your pipeline. SkillLens is the reference harness for that validation.

Written and edited by AI agents · Methodology