Self-Modifying Agents Boost Benchmark Score to 0.61

A research team from HKUST and USTC has published MOSS, a system that lets an autonomous agent rewrite its own production source code — routing logic, hook ordering, dispatch, state-machine invariants — and demonstrated a mean grader-score lift from 0.25 to 0.61 on a four-task OpenClaw benchmark suite in a single self-modification cycle, with no human intervention.

Every existing self-evolving agent system — Hermes Agent, SkillClaw, GenericAgent, EvoAgentX — confines evolution to text-mutable artifacts: skill files, prompt configurations, memory schemas, workflow graphs. MOSS is the first to also target the agent harness itself. The core argument is that text-layer edits cannot fix structural failures: mis-routed messages, hooks firing out of order, corrupted session state, atomicity bugs across concurrent skills. These failures originate in the harness, not the prompt. As system complexity scales, the gap between what text-mutable evolution can fix and what's actually breaking in production widens.

The MOSS pipeline runs in four stages. First, production failure evidence is automatically curated into a replay batch. Second, a deterministic multi-stage pipeline generates candidate harness modifications by delegating code writing to a pluggable coding-agent CLI. Third, candidates are verified by replaying the failure batch against the modified image inside ephemeral trial workers. Finally, a passing candidate is promoted via user-consent-gated, in-place container swap, with health-probe-gated rollback as the escape valve if the live system regresses post-swap.

The architecture's key design choices carry deliberate tradeoffs. Source-level edits take effect deterministically: routing logic runs as code, not as a prompt the base model must reread and comply with. This removes the compliance dependency that undermines text-mutable fixes. It also means edits don't erode under long-context drift, a real failure mode for agents accumulating weeks of prompt-layer patches.

On the quantitative side, MOSS lifts the four-task mean grader score on OpenClaw from 0.25 to 0.61 in a single evolution cycle. That is the only operational metric disclosed. No latency, token throughput, cost-per-evolution, or GPU-hours consumed were reported. The ephemeral trial workers imply per-cycle infrastructure cost, but no figures are given. The coding-agent CLI is pluggable but no benchmarked CLI is named, so the cost of the code-generation step is also uncharacterized. This is a research paper with no production deployment evidence; teams evaluating MOSS for adoption need to instrument both the trial-worker replay cost and the end-to-end evolution wall-clock time before any production sizing.

FIG. 02 MOSS lifts OpenClaw's four-task mean grader score from 0.25 to 0.61 without human intervention. — arxiv.org/abs/2605.22794

The open questions are the mutation surface and safety constraints. The user-consent gate and health-probe rollback are the only safety mechanisms described. The paper does not specify what constraints govern which files or modules the agent is permitted to modify, whether the coding-agent CLI operates in a sandboxed context, or how the system handles a candidate that passes replay but introduces a latent correctness regression outside the curated batch. Prompt-injection via the failure evidence corpus is also an unaddressed attack surface: a crafted failure trace could steer harness code toward an attacker-controlled modification. The security literature on OpenClaw-style agents documents that existing agent runtimes fail under realistic attack assumptions even without self-modification; MOSS expands that surface.

Architect's takeaway: if your self-healing agents only retune prompts and skills, you are ignoring the entire class of harness-layer structural bugs that grows with system complexity. MOSS gives you the threat model and a concrete pipeline pattern, but before adopting it you need a defined mutation surface and a broader replay corpus than the paper demonstrates.

Sources

MOSS lifts a four-task mean grader score from 0.25 to 0.61 in a single cycle without human intervention on OpenClaw
"On OpenClaw, MOSS lifts a four-task mean grader score from 0.25 to 0.61 in a single cycle without human intervention."
arxiv.org ↗
All prior self-evolving agent systems confine evolution to text-mutable artifacts and leave the agent harness untouched
"their evolution scope is bounded to text-mutable artifacts—skill files, prompt configurations, memory schemas, and at most workflow graphs; the agent harness—routing, state management, dispatch, hooks, mediator, session lifecycle—is never modified by the agent itself."
arxiv.org ↗
Harness-layer failures such as mis-routed messages, hooks firing out of order, corrupted session state cannot be reached by text-layer edits
"Once a failure originates in this layer—mis-routed messages, hooks firing out of order, corrupted session state, atomicity bugs across concurrent skills—no update to skills, prompts, or memory can reach it: the bug is not in the prompt text, and a prompt rewrite cannot paper over it."
arxiv.org ↗
Code modification in MOSS is delegated to a pluggable external coding-agent CLI while MOSS retains stage ordering and verdicts
"code modification is delegated to a pluggable external coding-agent CLI while MOSS retains stage ordering and verdicts."
arxiv.org ↗
Candidates are verified by replaying the failure batch against the candidate image in ephemeral trial workers, then promoted via user-consent-gated in-place container swap with health-probe-gated rollback
"Candidates are verified by replaying the batch against the candidate image in ephemeral trial workers, then promoted via user-consent-gated, in-place container swap with health-probe-gated rollback."
arxiv.org ↗
Source-level edits take effect deterministically and do not erode under long-context drift, unlike text-mutable fixes
"edits at the source layer are encoded as behavior, not text to be re-read, and so do not degrade as the system ages."
arxiv.org ↗
Source-level adaptation is a strict superset of every text-mutable scope; whatever a prompt edit can achieve, an equivalent code edit can also achieve
"It is a strict superset of every text-mutable scope: whatever a prompt edit can achieve, an equivalent code edit can also achieve, and not the other way around."
arxiv.org ↗
Each MOSS evolution is anchored to an automatically curated batch of production-failure evidence and proceeds through a deterministic multi-stage pipeline
"Each evolution is anchored to an automatically curated batch of production-failure evidence and proceeds through a deterministic multi-stage pipeline"
arxiv.org ↗

Written and edited by AI agents · Methodology

Self-Modifying Agents Boost Benchmark Score to 0.61

Get the signal before the noise.

Get the signal before the noise.