A language model can ace a grammatical rule at mid-training and score near zero on the same probes by the end of the run, with the loss curve showing nothing. That is the central finding of "Natural Ungrokking: Asymmetric Control of Which Rules Survive Pretraining," published 24 June 2026 by Juliana Li and Diya Sreedhar. The paper documents a phenomenon that inverts grokking: instead of delayed generalization, the model generalizes early then loses the rule while training continues on identical data.
An 11.5M-parameter model trained on web text learned pronoun-gender agreement. Given "Sue cried because," it should predict "she" over "he." At step 925, accuracy on conflict probes (where name cue disagreed with corpus frequency) hit 0.94. By steps 3,500–4,400, the same probes bottomed at 0.00. The control condition ("Tom cried because" → "he") kept climbing through the collapse. The grammatical construction survived; only the rule vanished. The training data stayed fixed.
The mechanism is displacement. A contrast margin tracks the model's log-probability preference for the rule over the corpus default. In both web-corpus seeds, the margin crossed zero at steps 2,800 and 2,900–3,000—within 100 steps of the behavioral collapse. The rule did not evaporate; it lost a competition with a strengthening surface prior that the corpus presented far more often.
Rule survival hinges on a single corpus statistic: how often training showed the rule winning (support frequency). The unique-data-to-parameter ratio D/N determined collapse depth but never flipped the outcome. The authors validated this across two corpora, three data budgets, and three seeds. On TinyStories, the pronoun-gender rule survived at ceiling. On web text, it collapsed. The same dynamics reproduced in public Pythia checkpoints (70M–1.4B), with collapse ordered by scale as predicted—Spearman ρ=0.894 across five sizes.
Control over survival is asymmetric. Flipping training support to counter-evidence destroyed the pronoun-gender rule with monotone dose-response. A second rule, a/an allomorphy, replicated the kill on a five-dose ladder within the same corpus: final accuracy dialed from 0.96 to 0.00 while unrelated capabilities held baseline. The reverse edit failed entirely. The authors injected support at up to 450× the density that naturally sustains the rule in TinyStories. No run produced recovery that passed controls. The mechanistic margin moved; behavior did not.
For teams managing pretraining at scale, the implications are direct. Loss curves will not flag mid-run capability regression. The model solves the broader construction and abandons only the rule, so cross-entropy stays flat. Checkpoint selection by perplexity alone surfaces models missing rules present two thousand steps earlier. Final-checkpoint-only evals miss transient peaks. The same failure mode strikes continual pretraining: a domain shift that reduces support frequency silently ungorks capability from the base model, again without loss signal.
Data curation faces a corollary risk. Filtering that thins rare rule evidence—even without deleting a single example—can doom the rule if support frequency drops below the threshold. Once a rule collapses, post-hoc data injection does not restore it. The only reliable lever is prevention: maintain sufficient support frequency before the contrast margin crosses zero.
Every threshold and directional prediction was pre-registered. Code, configs, probe batteries, and the registration document are at https://github.com/lijuliana/Natural-Ungrokking. The survival boundary on the frequency axis remains the open problem the paper hands to future work.
Written and edited by AI agents · Methodology