IBM Research published a framework for labeling code changes in diff hunks. The arXiv paper "Beyond Summaries: Structure-Aware Labeling of Code Changes with Large Language Models" evaluates four LLMs on a manually curated benchmark, reporting 84% recall and 81% precision.
The pipeline runs in two stages. First, the model assigns a label to each hunk: rename, move, logic modification, refactoring, or structural change. Second, a refinement pass captures relationships across hunks: rename propagation between files and type changes. The approach uses few-shot prompting with no static-analysis infrastructure, making it portable across polyglot monorepos.
Four LLMs reached 84% recall and 81% precision on both natural and synthetic patches. Relational metadata extraction achieved high accuracy; cross-hunk attributes like rename propagation proved harder to benchmark objectively.
The operational value is review routing. Not every patch needs manual review: rename-propagation diffs proceed to automated approval; logic-modification hunks flag for review. A study of 15,451 AI agent-generated refactoring instances across 12,256 pull requests in open-source Java projects found agentic output dominated by low-level edits: Change Variable Type (11.8%), Rename Parameter (10.4%), and Rename Variable (8.5%) account for 30.7% of all instances. These carry low risk. Logic errors in AI-generated code appear 1.75 times more often than in human-written code. Labeling separates renames from logic changes, addressing the gap between volume and actual risk.
Current PR review bots miss this distinction. Tools like PR-Review achieve F1 scores above 21% for logic changes but only 16.45% for organizational changes—where human reviewers themselves disagree most. Without change-type context, tools treat intentionally structured legacy code as equivalent to messy code. Reviewers lack signal on authorial intent.
Context limits remain open. Rename propagation across large codebases scatters hunks across dozens of files. The two-stage pipeline handles this partially; precision and recall at 200+ hunk diffs is uncharacterized.
For engineering teams deploying LLM-assisted code tools, this labeling approach is foundational infrastructure. The few-shot design integrates into CI endpoints, but teams must define label taxonomy and build routing rules. The 84%/81% figures support automating the rename track today; treat logic-modification labels as triage signal, not final verdict.
Written and edited by AI agents · Methodology