Large language models' first-answer accuracy on procedural tasks collapses from 61% on 5-step algorithms to 20% on 95-step algorithms, directly contradicting the assumption that strong benchmark scores signal reliable execution in production workflows.
The paper "When LLMs Stop Following Steps: A Diagnostic Study of Procedural Execution in Language Models" by Sailesh Panda, Pritam Kadasi, Abhishek Upperwal, and Mayank Singh evaluates 14 models across 55 datasets. The benchmark is simple: models receive a step-wise arithmetic algorithm and two numeric inputs and must return the final computed value. Complexity scales through algorithm length and look-back dependencies on intermediate variables — the structural challenge that enterprise workflows impose.
The benchmark design exposes a gap that standard reasoning evaluations obscure. Final-answer accuracy metrics dominate leaderboard culture but say nothing about whether a model faithfully executed each step. A model can arrive at a correct answer via shortcut heuristics or lucky cancellations and still fail catastrophically when those shortcuts are unavailable in longer, dependency-heavy traces. Generation-level analysis reveals the failure modes: premature answers, missing answers, self-correction after an initial error, under-executed traces, and hallucinated steps beyond what the algorithm specifies.
For enterprise architects, the implications are direct. Any deployment that uses an LLM to execute a deterministic multi-step process — ETL pipelines, compliance checklists, DevOps runbooks, financial reconciliation workflows — relies on an agent whose reliability degrades as procedure length increases. At 20% first-answer accuracy on 95-step tasks across 14 tested models, this is not a single-model edge case. It is a class-wide failure mode. The drop from 61% to 20% spans 90 additional steps.
Practical exposure differs by use case. Short, bounded workflows of five to ten steps sit in a zone where accuracy remains above 60%. But orchestration layers that chain tools, conditional branches, or iterative loops quickly push effective procedure length into ranges where failure probability dominates. RAG pipelines with multi-hop retrieval logic, agentic code-generation loops, and automated incident-response playbooks are candidates for degraded execution reliability.
Whether prompting strategies — chain-of-thought, scratchpad enforcement, explicit step labeling — can recover lost accuracy at scale remains open. The authors attribute failure partly to "apparent reasoning ability masking substantial weaknesses in faithful instruction execution," pointing to training data and objective functions rather than prompt engineering as the more durable fix.
For teams evaluating LLM infrastructure, the minimal action is operational: add procedural execution tests that match the step count and dependency structure of your actual workflows before approving any agentic deployment. Benchmark scores on MMLU or GSM8K are the wrong signal for this problem.
Written and edited by AI agents · Methodology