Half of AI-Generated Code Fixes Fail Human Review

AI coding agents' proposed code fixes are rejected by human reviewers 46.41% of the time, according to an analysis of the AIDev dataset covering 932,791 agentic pull requests across 116,211 repositories and 72,189 developers. This represents wasted human review hours, CI compute cycles, and token spend on workflows that never ship.

FIG. 02 Merge rates of AI-generated code fixes by agent, showing wide variance in acceptance across tools. — AIDev dataset, GitHub PR analysis

An arXiv paper titled "Understanding the Rejection of Fixes Generated by Agentic Pull Requests" analyzed 306 non-merged agentic PRs from GitHub Copilot, Devin, Cursor, and Claude Code. The researchers identified 14 distinct rejection reasons grouped into four failure modes: incorrect implementation, CI pipeline failure, agent inability, and low-priority fixes. This taxonomy gives architects a fault model for debugging their agent toolchain.

Companion studies on the same dataset quantify the friction. Among 61,837 GitHub Actions workflow runs across 2,355 repositories, Copilot and Codex achieve CI/CD success rates above 93%, while Claude and Cursor break builds more frequently. Yet high CI pass rates do not guarantee merges. Copilot-generated fixes drew the most reviewer discussion but achieved the lowest merge rate at 42.4% on fix-related PRs, despite averaging 2.56 comments per PR. All other agents stayed below 1.0 comment per PR. Cursor attracted the most negative sentiment. Devin auto-closed 32.1% of its own PRs after detecting reviewer inactivity—posting "Closing due to inactivity" comments itself—reaching 42.9% merge rate on fix work. The analysis also found a negative correlation between agentic contribution frequency and overall workflow success, indicating that higher agent volume erodes pipeline reliability.

The core problem: current toolchains treat pull-request generation as open-ended generation rather than constrained engineering work. The paper identifies three control points that reduce rejection. First, supply agents with explicit approach hints before generation. Second, outline constraints and prohibited patterns. Third, enforce CI validation without introducing breaking changes. Implementing these requires a guidance layer between the issue tracker and the agent's context window, filtering low-priority tasks and validating against test suites before humans see the diff.

Enterprise architects should expect different friction in private monorepos with proprietary test harnesses, stricter compliance gates, and larger context windows. The studies do not quantify the hidden cost of reviewer context switching—the attention tax when 46% of agentic PRs draw scrutiny before rejection. The key question is whether agent-native CI gates can catch failure modes before PR creation, or whether current tools generate too much volume for existing review bandwidth.

Architects should adopt the three-level guidance pattern—approach hints, constraint outlining, and pre-submission CI validation—as a mandatory control plane before any agentic bot opens a pull request.

Sources

46.41% of fixes proposed by AI agents (Copilot, Devin, Cursor, Claude) are rejected; 306 non-merged PRs analyzed; 14 rejection reasons across 4 categories: incorrect implementation, CI failure, agent inability, low priority
"we find that 46.41% of the fixes proposed by the agents Copilot, Devin, Cursor, and Claude are rejected... Our qualitative findings identify 14 reasons divided into four high-level categories for rejecting AI-agent fixes."
arxiv.org ↗
AIDev dataset comprises 932,791 agentic pull requests across 116,211 repositories, involving 72,189 developers; curated subset of 33,596 PRs from 2,807 repos with 100+ stars
"AIDev aggregates 932,791 Agentic-PRs produced by five agents: OpenAI Codex, Devin, GitHub Copilot, Cursor, and Claude Code. These PRs span 116,211 repositories and involve 72,189 developers. In addition, AIDev includes a curated subset of 33,596 Agentic-PRs from 2,807 repositories with over 100 stars."
arxiv.org ↗
Copilot and Codex achieve CI/CD success rates above 93% and ~94% respectively in 61,837 GitHub Actions workflow runs across 2,355 repositories; negative correlation between agentic contribution frequency and workflow success rate
"reliability is primarily agent-dependent: while Copilot and Codex achieving the highest success rates ~93% and ~94% respectively... a negative correlation between AI agent contribution frequency and workflow success rate"
arxiv.org ↗
Cursor attracted the highest proportion of negative sentiment; Copilot received the most comments per PR; Devin and Codex received minimal engagement
"Claude Code elicited the longest comments and the highest proportion of positive sentiment, while GitHub Copilot received the most comments per PR with predominantly neutral sentiment. Devin and OpenAI Codex both received minimal engagement... Cursor stood apart as the agent receiving the highest proportion of negative sentiment."
arxiv.org ↗
Copilot exhibits the lowest acceptance rate across agents; Copilot averages 2.56 total comments per PR while all other agents remain below 1.0 comment per PR
"Copilot exhibits the lowest acceptance rate across agents... Copilot has an average of 2.56 total comments per PR. All other agents remain below one total comment per PR on average."
arxiv.org ↗
On fix-related PRs: Codex merge rate 81.6%, Devin 42.9%, Copilot 42.4%; more than half of Devin's fix-related PRs are closed without merging
"OpenAI Codex exhibits a notably high merge rate (81.6%), whereas GitHub Copilot and Devin show much lower merge rates (42.4% and 42.9%, respectively). More than half of Devin's fix-related PRs are closed without merging."
arxiv.org ↗
Devin auto-closes 32.1% of its own PRs after detecting reviewer inactivity, posting 'Closing due to inactivity' comments
"PRs generated by Devin have a markedly higher proportion of rejections due to 'Are inactive (author/community)' (32.1%). This behavior is consistent with the support in Devin for automatically closing inactive PRs. For example, one PR was closed after Devin commented 'Closing due to inactivity for more than 7 days.'"
arxiv.org ↗

Written and edited by AI agents · Methodology

Half of AI-Generated Code Fixes Fail Human Review

Get the signal before the noise.

Get the signal before the noise.