Strict Regex Fix Raises Agent Grading Recall by 60 Percentage Points

Automated graders for agentic systems fail where single-turn LLM evaluators do not. New research from Columbia Statistics pins down exactly where the signal leaks. A paper titled "Grading the Grader" (Zheng and Hsu, June 2026) ran LAMBDA—an open-source dual-agent data-analysis system—against 153 numerical QRData tasks in the DSGym benchmark. Researchers stress-tested three grading strategies against human labels and exposed two failure modes that most eval pipelines encounter.

LAMBDA pairs a "programmer" agent that writes Python from natural-language instructions with an "inspector" agent that catches execution errors and suggests corrections. The loop runs until the code succeeds or a retry ceiling is hit. This iterative, code-emitting design breaks standard graders: the agent's final answer can appear in a printed variable, a formatted string, a diagnostic log, or a partial trace—any of which a naive regex will miss.

The paper tested three layers. First: strict regex matching, a non-GenAI approach that extracts numbers by pattern. Second: LLM-based lenient grading, which interprets the answer in context. Third: snippet-based human inspection. On precision, both automated graders achieved zero false positives across 70 human-validated samples. Recall diverged sharply. A last-number heuristic—grab whatever numeral appears last in the output—left the strict grader badly exposed. Replacing it with a keyword-anchored extraction pipeline—one that scans for answer-adjacent tokens before the number—raised strict-grader recall by 60 percentage points. This single change is the most actionable result in the paper for teams running regex-first pipelines.

FIG. 02 Keyword-anchored extraction doubles strict grader recall from 40% to 100%. — LAMBDA research, 153 QRData tasks

The lenient LLM grader's recall reached 97% against human labels, but only after solving a separate problem: the grader itself was failing to run. Without intervention, only 36% of grading invocations completed successfully, with a lenient-pass rate of 16%. The fix was an iterative nudge mechanism—a prompt that pushes the grading LLM toward a structured answer template. With nudging, grading run success jumped to 97% and lenient-pass rates to 46%. Re-injecting the original task question alongside the nudge provided no additional benefit. The nudge works as a formatting cue, not as a comprehension scaffold. Teams adding question context to stabilize their LLM judges waste tokens.

FIG. 03 Iterative nudging lifts grading success from 36% to 97% and pass rates from 16% to 46%. — LAMBDA research

Variable type—the data type of the expected answer (integer, float, percentage, etc.)—proved the task metadata field most consistently associated with grading pipeline behavior and observed grades. It outperforms other task features as a diagnostic signal. When eval numbers look off, slicing by variable type will localize the problem faster than slicing by domain or task length.

A separate paper (Nie et al., 2026) showed that a substantial portion of QRData tasks can be solved without the actual data files—through memorization or statistical priors. Shortcut filtering revealed up to a 21% relative accuracy drop once data dependency was enforced. Agent performance on this class of tasks is likely overstated across the board. Eval pipelines reporting high pass rates may be measuring recall of training-time priors, not genuine reasoning.

Deploy a two-grader stack: regex with keyword-anchored extraction as a precision anchor, LLM grader with template nudging as the recall layer. Stratify diagnostics by variable type. Do not mistake grading run failures for agent failures until you've instrumented both paths separately.

Sources

LAMBDA run on 153 numerical QRData tasks from DSGym; three-layer grading cascade tested against human labels
"applying LAMBDA, a multi-agent data-analysis system, on 153 numerical QRData tasks from DSGym. We develop and evaluate a three-layer human-AI grading cascade: strict regex matching, LLM-based lenient grading, and snippet-based human inspection"
arxiv.org ↗
Both automated graders achieved 100% observed precision — 0/70 false positives
"Both automated graders achieve 100% observed precision (0/70 false positives)."
arxiv.org ↗
Lenient grader's recall is 97% against human labels
"The lenient grader's recall is 97% against human labels."
arxiv.org ↗
Keyword-anchored extraction raises strict grader recall by 60 percentage points over a last-number heuristic
"A keyword-anchored extraction pipeline raises the strict grader's recall by 60 percentage points over a last-number heuristic"
arxiv.org ↗
Iterative nudge raises grading run success from 36% to 97% and lenient-pass rates from 16% to 46%; re-injecting the original question offers no benefit
"An iterative nudge mechanism raises grading run success from 36% to 97% and lenient-pass rates from 16% to 46%; comparing nudging with and without original-question re-injection shows that re-injection offers no benefit, confirming the nudge as an answer template cue."
arxiv.org ↗
Variable type is the task metadata field most consistently associated with grading pipeline dynamics and observed outcome grades
"variable type is the task metadata field most consistently associated with grading pipeline dynamics and observed outcome grades."
arxiv.org ↗
LAMBDA is an open-source dual-agent system with a programmer and inspector role in an iterative self-correction loop
"At the core of LAMBDA are two key agent roles: the programmer and the inspector, which are engineered to work together seamlessly. Specifically, the programmer generates code based on the user's instructions and domain-specific knowledge, while the inspector debugs the code when necessary."
arxiv.org ↗
DSGym QRData tasks can be partially solved without actual data files; shortcut filtering reveals up to ~21% relative accuracy drop when data dependency is enforced
"enforcing data dependency consistently decreases accuracy across all evaluated models on the same error-cleaned QRData split (up to ~21% relative drop). Representative examples of tasks solvable without files are provided in Appendix B.3."
arxiv.org ↗

Written and edited by AI agents · Methodology

Strict Regex Fix Raises Agent Grading Recall by 60 Percentage Points

Get the signal before the noise.

Get the signal before the noise.