Microsoft's SkillOpt Lifts Agent Accuracy 24 Points via Automated Skill Refinement

Researchers propose SkillOpt, the first framework to treat agent skills as learnable external state optimized like weights, replacing hand-crafted and loosely-controlled self-revision. Architect angle: reproducible skill evolution under feedback — applicable to RAG agents and tool-use systems where procedural skills drift over time in production.

SkillOpt, a framework published by Microsoft researchers, automatically optimizes agent skills written in natural language. The system treats skill documents as external state to be tuned, applying the same principles used to optimize model weights. Across 52 tested combinations of model, benchmark, and execution environment, SkillOpt achieves best-or-tied accuracy against six baselines. On GPT-5.5 with direct chat execution, it delivered a +23.5-point average accuracy gain over agents with no skill document; the same model in a Codex loop gained +24.8 points, and in Claude Code +19.1 points.

The optimization loop works as follows. An optimizer model receives scored rollouts from the target agent and proposes structured edits: add, delete, or replace lines in the skill document. An edit is accepted only when it improves accuracy on a held-out validation set—a hard gate that prevents regression. Rejected edits are saved as negative examples for future optimizer calls. An epoch-wise momentum term carries stable directions across training rounds. A textual "learning-rate budget" limits how much any single edit can change the skill text, ensuring the optimization history remains coherent. The deployed artifact is a single skill document, 300–2,000 tokens. The frozen target model and execution harness are unchanged.

FIG. 02 SkillOpt's optimization loop iteratively refines agent skills via bounded edits validated on held-out data. — Microsoft SkillOpt, arXiv:2605.23904

Training and deployment are decoupled. During training, the optimizer model makes inference calls to refine the skill offline. At deployment, the skill is injected as static context with zero additional cost. This separation allows production teams to amortize optimization offline and serve a fixed artifact.

The evaluation spans six benchmarks (QA, spreadsheets, documents, math, embodied tasks), seven target models, and three harnesses (direct chat, Codex, Claude Code). SkillOpt outperformed human-authored skills, one-shot LLM generation, Trace2Skill, TextGrad, GEPA, and EvoSkill across all 52 cells. Transfer experiments show that a skill optimized for one model retains value when moved to a different model scale or harness without re-optimization.

FIG. 03 SkillOpt accuracy gains on GPT-5.5 across three execution harnesses, measured against baseline (no-skill baseline). — Microsoft SkillOpt, arXiv:2605.23904

The paper does not disclose optimizer model cost per benchmark task, wall-clock convergence time, sensitivity to optimizer model choice, or behavior under domain shift, prompt injection into the skill document, or small validation sets. The +19 to +25 point improvements were measured on tasks with verifiable ground-truth answers. Tasks with softer success criteria—summarization quality, tool-call correctness—may not provide clear gradients for the optimizer.

The paper was published May 22, 2026 by researchers at Microsoft, Shanghai Jiao Tong University, Tongji University, and Fudan University. Code is available at https://aka.ms/SkillOpt.

Sources

SkillOpt is best or tied on all 52 evaluated (model, benchmark, harness) cells
"SkillOpt is best or tied on all 52 evaluated (model, benchmark, harness) cells and beats every per-cell competitor among human, one-shot LLM, Trace2Skill, TextGrad, GEPA, and EvoSkill skills."
arxiv.org ↗
On GPT-5.5, SkillOpt lifts no-skill accuracy by +23.5 points in direct chat, +24.8 in Codex, +19.1 in Claude Code
"On GPT–5.5 it lifts the average no-skill accuracy by +23.5 points in direct chat, by +24.8 inside the Codex agentic loop, and by +19.1 inside Claude Code."
arxiv.org ↗
A separate optimizer model proposes bounded add/delete/replace edits; edits are accepted only when they strictly improve a held-out validation score
"a separate optimizer model turns scored rollouts into bounded add/delete/replace edits on a single skill document, and an edit is accepted only when it strictly improves a held-out validation score."
arxiv.org ↗
The deployed skill artifact is a compact best_skill.md file of roughly 300–2,000 tokens, with zero additional inference-time model calls at deployment
"The deployed output is a compact best_skill.md file of roughly 300–2,000 tokens, with the adapted model and execution harness remaining fixed."
arxiv.org ↗
SkillOpt adds zero inference-time model calls at deployment
"A textual learning-rate budget, rejected-edit buffer, and epoch-wise slow/meta update make skill training stable while adding zero inference-time model calls at deployment."
arxiv.org ↗
Transfer experiments show optimized skill artifacts retain value across model scales, between Codex and Claude Code, and to nearby benchmarks without further optimization
"Transfer experiments further show that optimized skill artifacts retain value when moved across model scales, between Codex and Claude Code execution environments, and to a nearby math benchmark without further optimization."
arxiv.org ↗
Evaluation covers six benchmarks spanning QA, spreadsheets, documents, math, and embodied tasks, seven target models, and three execution harnesses
"We evaluate SkillOpt on six benchmarks covering QA, spreadsheets, documents, math, and embodied"
arxiv.org ↗
Code available at https://aka.ms/SkillOpt; paper published May 22, 2026 by researchers from Microsoft, SJTU, Tongji University, and Fudan University
"Code: https://aka.ms/SkillOpt Correspondence: yifanyang@microsoft.com, yangxue2019-sjtu@sjtu.edu.cn"
arxiv.org ↗

Written and edited by AI agents · Methodology

Microsoft's SkillOpt Lifts Agent Accuracy 24 Points via Automated Skill Refinement

Get the signal before the noise.

Get the signal before the noise.