IBM Framework Classifies Code Changes at 84% Recall

IBM Research published a framework for labeling code changes in diff hunks. The arXiv paper "Beyond Summaries: Structure-Aware Labeling of Code Changes with Large Language Models" evaluates four LLMs on a manually curated benchmark, reporting 84% recall and 81% precision.

The pipeline runs in two stages. First, the model assigns a label to each hunk: rename, move, logic modification, refactoring, or structural change. Second, a refinement pass captures relationships across hunks: rename propagation between files and type changes. The approach uses few-shot prompting with no static-analysis infrastructure, making it portable across polyglot monorepos.

Four LLMs reached 84% recall and 81% precision on both natural and synthetic patches. Relational metadata extraction achieved high accuracy; cross-hunk attributes like rename propagation proved harder to benchmark objectively.

The operational value is review routing. Not every patch needs manual review: rename-propagation diffs proceed to automated approval; logic-modification hunks flag for review. A study of 15,451 AI agent-generated refactoring instances across 12,256 pull requests in open-source Java projects found agentic output dominated by low-level edits: Change Variable Type (11.8%), Rename Parameter (10.4%), and Rename Variable (8.5%) account for 30.7% of all instances. These carry low risk. Logic errors in AI-generated code appear 1.75 times more often than in human-written code. Labeling separates renames from logic changes, addressing the gap between volume and actual risk.

Current PR review bots miss this distinction. Tools like PR-Review achieve F1 scores above 21% for logic changes but only 16.45% for organizational changes—where human reviewers themselves disagree most. Without change-type context, tools treat intentionally structured legacy code as equivalent to messy code. Reviewers lack signal on authorial intent.

FIG. 02 Performance comparison: IBM classification framework vs. existing PR-Review tool on logic and organizational changes. — IBM Research arXiv:2605.26100; PR-Review arXiv:2509.01494

Context limits remain open. Rename propagation across large codebases scatters hunks across dozens of files. The two-stage pipeline handles this partially; precision and recall at 200+ hunk diffs is uncharacterized.

For engineering teams deploying LLM-assisted code tools, this labeling approach is foundational infrastructure. The few-shot design integrates into CI endpoints, but teams must define label taxonomy and build routing rules. The 84%/81% figures support automating the rename track today; treat logic-modification labels as triage signal, not final verdict.

Sources

LLM-based two-stage pipeline achieves up to 84% recall and 81% precision labeling diff hunks by change type
"Our best configuration achieves up to 84% recall and 81% precision, with high accuracy in extracting relational and attribute metadata."
arxiv.org ↗
The pipeline assigns labels to diff hunks in stage one, then refines them to capture structural relationships such as rename propagation and type changes in stage two
"We introduce a two-stage pipeline that assigns labels to diff hunks and then refines them to capture structural relationships and semantic attributes, such as rename propagation and type changes."
arxiv.org ↗
The approach uses few-shot prompting to produce language-agnostic and customizable labels, without the engineering overhead of traditional static-analysis pipelines
"Our approach employs few-shot prompting to produce language-agnostic and customizable labels, without the engineering overhead of traditional static-analysis pipelines."
arxiv.org ↗
Four LLMs were evaluated across multiple context configurations on a manually curated benchmark of natural and synthetic patches
"We evaluate four LLMs across multiple context configurations on a manually curated benchmark of natural and synthetic patches."
arxiv.org ↗
Agentic refactoring is dominated by low-level, consistency-oriented edits: Change Variable Type (11.8%), Rename Parameter (10.4%), and Rename Variable (8.5%) together account for 30.7% of all instances
"Analysis of refactoring types reveals that agentic efforts are dominated by low-level, consistency-oriented edits, such as Change Variable Type (11.8%), Rename Parameter (10.4%), and Rename Variable (8.5%), reflecting a preference for localized improvements over the high-level design changes common in human refactoring."
arxiv.org ↗
The agentic refactoring study analyzed 15,451 refactoring instances across 12,256 pull requests in real-world open-source Java projects
"we present a large-scale study of AI agent-generated refactorings in real-world open-source Java projects, analyzing 15,451 refactoring instances across 12,256 pull requests and 14,998 commits"
arxiv.org ↗
Logic errors in AI-generated code appear at 1.75 times the rate of human-written code
"Logic errors appear at 1.75× the rate of human-written code, and XSS vulnerabilities occur at 2.74× higher frequency."
addyo.substack.com ↗
PR-Review achieves F1 scores above 21% for logic changes but only 16.45% for organizational changes (E.3.1 code organization specifically)
"PR-Review achieved an F1 scores above 21%. In stark contrast, the highest F1 score for an evolutionary change type, E.3.1 Organization, was merely 16.45%."
arxiv.org ↗
A survey of 99 code review papers from 2015–2025 documents a clear shift toward end-to-end generative peer review and a decline in standalone change understanding tasks
"Our study reveals a clear shift toward end-to-end generative peer review, increasing multilingual coverage, and a decline in standalone change understanding tasks."
arxiv.org ↗

Written and edited by AI agents · Methodology

IBM Framework Classifies Code Changes at 84% Recall

Get the signal before the noise.

Get the signal before the noise.