SLIM improves LLM agent performance 7 percentage points

A new framework called SLIM (Skill LIfecycle Management) treats the active external skill set in LLM agents as a dynamic optimization variable jointly updated with policy learning. On ALFWorld and SearchQA benchmarks, SLIM outperforms existing baselines by an average of 7.1 percentage points.

Current approaches to skill-based agentic reinforcement learning assume one of two extremes: skills accumulate indefinitely as persistent external guidance, or they internalize fully into the model's weights. Researchers Junhao Shen, Teng Zhang, Xiaoyan Zhao, and Hong Cheng show both paths are suboptimal. With limited parametric capacity and uneven marginal contribution across skills, the optimal active skill set is non-monotonic. A skill critical at step 10 becomes dead weight by step 100. A capability gap absent at launch may emerge as the agent encounters new failure modes.

SLIM operationalizes this through three lifecycle operations. First, it estimates each active skill's marginal external contribution using leave-one-skill-out validation. Skills with high external value are retained. Skills whose marginal contribution drops to negligible get retired from the active set. When persistent failures signal an uncovered capability gap, the skill bank expands. The result is a continuously pruned and replenished tool inventory tracking the agent's evolving needs rather than accumulating historical artifacts.

Tool bloat is a concrete failure mode in multi-step agentic pipelines. As agents acquire skills across deployment cycles, active context fills with rarely-invoked tools, routing logic degrades, and debugging becomes combinatorially harder. SLIM's retirement mechanism provides a formal criterion for when to drop a skill—not a heuristic—which compliance and reliability teams can audit and enforce.

SLIM's experiments also challenge the conventional wisdom that policy learning and external skill retention trade off against each other. The data show they are not mutually exclusive: some skills absorb into the model's weights over training and no longer require external invocation, while others continue delivering value as external tools. This bifurcation means skill management systems should track internalization separately from external utility rather than treating all skills as candidates for eventual absorption.

FIG. 02 SLIM outperforms best baselines by 7.1 percentage points on average across ALFWorld and SearchQA benchmarks. — Shen et al., 2026

Open questions remain around scaling and skill representation. The paper benchmarks on ALFWorld (text-based household tasks) and SearchQA (retrieval-augmented QA), both relatively constrained domains. Whether SLIM's leave-one-skill-out validation remains computationally tractable as the skill bank grows to hundreds—typical for enterprise copilot deployments with integrated APIs—is unaddressed. Skill representation itself is left implicit. The framework assumes skills are already modularized and independently evaluable, a precondition requiring upfront engineering investment.

Teams designing agent infrastructure have a sharp tool for skill pruning: a quantitative, validation-driven retirement criterion is superior to the ad-hoc deprecation policies most production systems currently rely on.

Sources

SLIM outperforms best baselines by an average of 7.1 percentage points across ALFWorld and SearchQA
"SLIM outperforms the best baselines by an average of 7.1% points across ALFWorld and SearchQA."
arxiv.org ↗
SLIM treats the active external skill set as a dynamic optimization variable jointly updated with policy learning
"SLIM, a framework of dynamic Skill LIfecycle Management for agentic reinforcement learning (RL), which treats the active external skill set as a dynamic optimization variable jointly updated with policy learning."
arxiv.org ↗
The optimal active skill set is non-monotonic and task- and stage-dependent
"the optimal active skill set is non-monotonic, task- and stage-dependent."
arxiv.org ↗
SLIM estimates each skill's marginal contribution via leave-one-skill-out validation, then applies retain, retire, and expand operations
"SLIM estimates each active skill's marginal external contribution through leave-one-skill-out validation, then applies three lifecycle operations: retaining high-value skills, retiring skills whose contribution becomes negligible after sufficient exposure, and expanding the skill bank when persistent failures reveal missing capability coverage."
arxiv.org ↗
Some skills are absorbed into the policy while others continue to provide external value
"some skills are absorbed into the policy, while others continue to provide external value, supporting SLIM as a more general paradigm for skill-based agentic RL."
arxiv.org ↗
Paper authored by Junhao Shen, Teng Zhang, Xiaoyan Zhao, and Hong Cheng, published May 11 2026
"AUTHORS: Junhao Shen, Teng Zhang, Xiaoyan Zhao, Hong Cheng PUBLISHED: 2026-05-11T17:55:13Z"
arxiv.org ↗

Written and edited by AI agents · Methodology

SLIM improves LLM agent performance 7 percentage points

Get the signal before the noise.

Get the signal before the noise.