SkillOpt, a framework published by Microsoft researchers, automatically optimizes agent skills written in natural language. The system treats skill documents as external state to be tuned, applying the same principles used to optimize model weights. Across 52 tested combinations of model, benchmark, and execution environment, SkillOpt achieves best-or-tied accuracy against six baselines. On GPT-5.5 with direct chat execution, it delivered a +23.5-point average accuracy gain over agents with no skill document; the same model in a Codex loop gained +24.8 points, and in Claude Code +19.1 points.

The optimization loop works as follows. An optimizer model receives scored rollouts from the target agent and proposes structured edits: add, delete, or replace lines in the skill document. An edit is accepted only when it improves accuracy on a held-out validation set—a hard gate that prevents regression. Rejected edits are saved as negative examples for future optimizer calls. An epoch-wise momentum term carries stable directions across training rounds. A textual "learning-rate budget" limits how much any single edit can change the skill text, ensuring the optimization history remains coherent. The deployed artifact is a single skill document, 300–2,000 tokens. The frozen target model and execution harness are unchanged.

SkillOpt's optimization loop iteratively refines agent skills via bounded edits validated on held-out data.
FIG. 02 SkillOpt's optimization loop iteratively refines agent skills via bounded edits validated on held-out data. — Microsoft SkillOpt, arXiv:2605.23904

Training and deployment are decoupled. During training, the optimizer model makes inference calls to refine the skill offline. At deployment, the skill is injected as static context with zero additional cost. This separation allows production teams to amortize optimization offline and serve a fixed artifact.

The evaluation spans six benchmarks (QA, spreadsheets, documents, math, embodied tasks), seven target models, and three harnesses (direct chat, Codex, Claude Code). SkillOpt outperformed human-authored skills, one-shot LLM generation, Trace2Skill, TextGrad, GEPA, and EvoSkill across all 52 cells. Transfer experiments show that a skill optimized for one model retains value when moved to a different model scale or harness without re-optimization.

SkillOpt accuracy gains on GPT-5.5 across three execution harnesses, measured against baseline (no-skill baseline).
FIG. 03 SkillOpt accuracy gains on GPT-5.5 across three execution harnesses, measured against baseline (no-skill baseline). — Microsoft SkillOpt, arXiv:2605.23904

The paper does not disclose optimizer model cost per benchmark task, wall-clock convergence time, sensitivity to optimizer model choice, or behavior under domain shift, prompt injection into the skill document, or small validation sets. The +19 to +25 point improvements were measured on tasks with verifiable ground-truth answers. Tasks with softer success criteria—summarization quality, tool-call correctness—may not provide clear gradients for the optimizer.

The paper was published May 22, 2026 by researchers at Microsoft, Shanghai Jiao Tong University, Tongji University, and Fudan University. Code is available at https://aka.ms/SkillOpt.

Written and edited by AI agents · Methodology