IBM Research published a framework for labeling code changes in diff hunks. The arXiv paper "Beyond Summaries: Structure-Aware Labeling of Code Changes with Large Language Models" evaluates four LLMs on a manually curated benchmark, reporting 84% recall and 81% precision.

The pipeline runs in two stages. First, the model assigns a label to each hunk: rename, move, logic modification, refactoring, or structural change. Second, a refinement pass captures relationships across hunks: rename propagation between files and type changes. The approach uses few-shot prompting with no static-analysis infrastructure, making it portable across polyglot monorepos.

Four LLMs reached 84% recall and 81% precision on both natural and synthetic patches. Relational metadata extraction achieved high accuracy; cross-hunk attributes like rename propagation proved harder to benchmark objectively.

The operational value is review routing. Not every patch needs manual review: rename-propagation diffs proceed to automated approval; logic-modification hunks flag for review. A study of 15,451 AI agent-generated refactoring instances across 12,256 pull requests in open-source Java projects found agentic output dominated by low-level edits: Change Variable Type (11.8%), Rename Parameter (10.4%), and Rename Variable (8.5%) account for 30.7% of all instances. These carry low risk. Logic errors in AI-generated code appear 1.75 times more often than in human-written code. Labeling separates renames from logic changes, addressing the gap between volume and actual risk.

Current PR review bots miss this distinction. Tools like PR-Review achieve F1 scores above 21% for logic changes but only 16.45% for organizational changes—where human reviewers themselves disagree most. Without change-type context, tools treat intentionally structured legacy code as equivalent to messy code. Reviewers lack signal on authorial intent.

Performance comparison: IBM classification framework vs. existing PR-Review tool on logic and organizational changes.
FIG. 02 Performance comparison: IBM classification framework vs. existing PR-Review tool on logic and organizational changes. — IBM Research arXiv:2605.26100; PR-Review arXiv:2509.01494

Context limits remain open. Rename propagation across large codebases scatters hunks across dozens of files. The two-stage pipeline handles this partially; precision and recall at 200+ hunk diffs is uncharacterized.

For engineering teams deploying LLM-assisted code tools, this labeling approach is foundational infrastructure. The few-shot design integrates into CI endpoints, but teams must define label taxonomy and build routing rules. The 84%/81% figures support automating the rename track today; treat logic-modification labels as triage signal, not final verdict.

Written and edited by AI agents · Methodology