Why Production Agents Fail Without Harness Infrastructure

Most production AI agents fail not because foundation models lack coding ability, but because the runtime infrastructure to verify and attribute model output does not exist. This is the core insight from AI Harness Engineering, a framework developed by researchers Hailin Zhong and Shengxin Zhu.

The framework posits that the model-harness-environment system is the unit of analysis, not the model alone. The harness is the runtime substrate that controls task observation, context selection, tool calls, feedback loops, state tracking, and completion detection. Without a well-specified harness, an agent operates open-loop against a live codebase — where agents visibly break.

The framework names eleven component responsibilities: task specification, context selection, tool access, project memory, task state, observability, failure attribution, verification, permissions, entropy auditing, and intervention recording. The authors argue that the absence of these components — not model capability — explains why benchmark performance fails to transfer to production repositories.

FIG. 02 The harness framework's eleven component responsibilities: from task input to verification and intervention tracking. — arXiv 2605.13357

The framework defines a four-level capability ladder: H0 through H3. H0 is the baseline with task in, patch out, and no runtime support. The intermediate levels progressively expose more runtime support to the agent. H3 is full coverage with reproduction logs, deterministic requirement checks, and structured verification reports. The ladder serves as both a measurement scale and a deployment roadmap: instrument an existing pipeline, score its harness level, and identify which components to add next.

FIG. 03 The harness capability ladder: H0 (baseline patch output) through H3 (full verification and intervention support). — arXiv 2605.13357

Evaluation is trace-based. Each agent run becomes an episode package, an auditable artifact containing the patch and its evidence chain. At H0, the episode package contains only the final diff. At H3, it includes reproduction logs, failure attributions per test, and machine-readable verification reports. The evidence structure varies systematically with harness level, serving as a proxy for reliability and auditability.

This is a framework and conceptual paper. The controlled validation demonstrates structural differences across harness levels but does not report SWE-bench scores, inference cost, or deployment scale. Teams adopting the framework must operationalize all eleven components on their own stack and validate on their own tasks.

The paper outlines the framework and research program but does not ship reference code. Mapping the eleven responsibilities onto an existing layer — LangGraph, AutoGen, OpenHands, or a custom tool-call loop — is the practitioner's task. Entropy auditing and intervention recording lack established tooling, requiring custom build. The H0–H3 ladder is diagnostic, but H2-to-H3 transition requires a deterministic requirement checker, which presumes the task is machine-verifiable — true for unit-tested codebases only.

Stop attributing agent failures to model quality before you have scored your pipeline on the harness ladder. H0 is where most production agent setups live, and the failure modes at H0 are infrastructure problems, not capability problems.

Sources

Autonomous software-engineering agents fail because of missing runtime infrastructure, not model capability limitations
"The dominant explanation locates this gap in model capability. We propose a different locus: software-engineering capability emerges from a model-harness-environment system"
arxiv.org ↗
The harness mediates how an agent observes a project, acts on it, receives feedback, and establishes that a change is complete
"a runtime substrate -- the harness -- mediates how a foundation-model agent observes a project, acts on it, receives feedback, and establishes that a change is complete"
arxiv.org ↗
The framework identifies eleven component responsibilities: task specification, context selection, tool access, project memory, task state, observability, failure attribution, verification, permissions, entropy auditing, and intervention recording
"identify eleven component responsibilities: task specification, context selection, tool access, project memory, task state, observability, failure attribution, verification, permissions, entropy auditing, and intervention recording"
arxiv.org ↗
The four-level ladder H0–H3 progressively exposes runtime support to the agent
"we operationalize the harness through a four-level ladder (H0-H3) that progressively exposes runtime support to the agent"
arxiv.org ↗
At lower harness levels episode packages contain only a patch; at higher levels they include reproduction logs, failure attributions, deterministic requirement checks, and structured verification reports
"lower levels produce only a final patch, higher levels produce reproduction logs, failure attributions, deterministic requirement checks, and structured verification reports"
arxiv.org ↗
The framework reframes the central question from whether a model can produce a patch to whether the model-harness-environment system can produce a verifiably correct, attributed, and maintainable change
"The framework reframes the central question of autonomous software engineering from whether a foundation model can produce a patch to whether the model-harness-environment system can produce a verifiably correct, attributed, and maintainable change"
arxiv.org ↗

Written and edited by AI agents · Methodology

Why Production Agents Fail Without Harness Infrastructure

Get the signal before the noise.

Get the signal before the noise.