Models Shed Learned Rules During Training

A language model can ace a grammatical rule at mid-training and score near zero on the same probes by the end of the run, with the loss curve showing nothing. That is the central finding of "Natural Ungrokking: Asymmetric Control of Which Rules Survive Pretraining," published 24 June 2026 by Juliana Li and Diya Sreedhar. The paper documents a phenomenon that inverts grokking: instead of delayed generalization, the model generalizes early then loses the rule while training continues on identical data.

An 11.5M-parameter model trained on web text learned pronoun-gender agreement. Given "Sue cried because," it should predict "she" over "he." At step 925, accuracy on conflict probes (where name cue disagreed with corpus frequency) hit 0.94. By steps 3,500–4,400, the same probes bottomed at 0.00. The control condition ("Tom cried because" → "he") kept climbing through the collapse. The grammatical construction survived; only the rule vanished. The training data stayed fixed.

FIG. 02 Conflict probe accuracy peaks at 0.94 mid-training, collapses to ~0.00 by steps 3,500–4,400 with no change in training data. — Source: arxiv.org/abs/2606.26050v1

The mechanism is displacement. A contrast margin tracks the model's log-probability preference for the rule over the corpus default. In both web-corpus seeds, the margin crossed zero at steps 2,800 and 2,900–3,000—within 100 steps of the behavioral collapse. The rule did not evaporate; it lost a competition with a strengthening surface prior that the corpus presented far more often.

Rule survival hinges on a single corpus statistic: how often training showed the rule winning (support frequency). The unique-data-to-parameter ratio D/N determined collapse depth but never flipped the outcome. The authors validated this across two corpora, three data budgets, and three seeds. On TinyStories, the pronoun-gender rule survived at ceiling. On web text, it collapsed. The same dynamics reproduced in public Pythia checkpoints (70M–1.4B), with collapse ordered by scale as predicted—Spearman ρ=0.894 across five sizes.

Control over survival is asymmetric. Flipping training support to counter-evidence destroyed the pronoun-gender rule with monotone dose-response. A second rule, a/an allomorphy, replicated the kill on a five-dose ladder within the same corpus: final accuracy dialed from 0.96 to 0.00 while unrelated capabilities held baseline. The reverse edit failed entirely. The authors injected support at up to 450× the density that naturally sustains the rule in TinyStories. No run produced recovery that passed controls. The mechanistic margin moved; behavior did not.

For teams managing pretraining at scale, the implications are direct. Loss curves will not flag mid-run capability regression. The model solves the broader construction and abandons only the rule, so cross-entropy stays flat. Checkpoint selection by perplexity alone surfaces models missing rules present two thousand steps earlier. Final-checkpoint-only evals miss transient peaks. The same failure mode strikes continual pretraining: a domain shift that reduces support frequency silently ungorks capability from the base model, again without loss signal.

Data curation faces a corollary risk. Filtering that thins rare rule evidence—even without deleting a single example—can doom the rule if support frequency drops below the threshold. Once a rule collapses, post-hoc data injection does not restore it. The only reliable lever is prevention: maintain sufficient support frequency before the contrast margin crosses zero.

Every threshold and directional prediction was pre-registered. Code, configs, probe batteries, and the registration document are at https://github.com/lijuliana/Natural-Ungrokking. The survival boundary on the frequency axis remains the open problem the paper hands to future work.

Sources

By step 925 the model scores 0.94 on held-out conflict probes; by steps 3,500–4,400 the same probes score near 0.00, with no change to the training data or distribution
"Midway through an ordinary pretraining run, a small language model learns the pronoun-gender rule: cued with a girl's name, it resolves the next pronoun to she, generalizing to held-out probes (0.94 by step 925). By step 3,500 the same model scores near zero on the same probes, although the rule's evidence is still in the training data."
arxiv.org ↗
The agree-condition control keeps climbing through the collapse — the grammatical construction stays solved, only the rule is lost
"The model scores 1.00 when rule and prior agree and the rule's evidence stays in the stationary stream. No training data was removed, and no distribution shifted."
arxiv.org ↗
The contrast margin crosses zero within 100 training steps of the behavioral collapse at steps 2,800/2,800 and 2,900/3,000 in two web seeds
"the log-probability margin between them crosses zero within 100 training steps of the behavioral collapse. The two coincide in both seeds (steps 2,800/2,800 and 2,900/3,000)."
arxiv.org ↗
Support frequency decides a rule's fate; D/N modulates collapse depth but never flips fate — validated across two corpora, three data budgets, three seeds
"Across un-intervened runs (two corpora, three budgets, three seeds), support frequency decides a rule's fate; the data-to-parameter ratio only modulates how deeply a doomed rule falls."
arxiv.org ↗
Emerge-then-collapse dynamics reproduce in Pythia checkpoints 70M–1.4B with Spearman ρ=0.894 across five model sizes
"The same rise-and-fall reproduces in the smaller public Pythia checkpoints, with collapse depth following the predicted scale order... ρ=0.894 across five Pythia sizes, collapse gone by 410M."
arxiv.org ↗
The a/an allomorphy rule replicates rule destruction on a five-dose ladder, dialing accuracy from 0.96 to 0.00
"A second, unrelated rule (a/an allomorphy) replicates the kill on a five-dose ladder built inside one corpus, dialing final accuracy monotonically from 0.96 to 0.00 while unrelated capabilities hold at baseline."
arxiv.org ↗
Injecting support back at up to 450x the density that naturally sustains the rule produced no behavioral recovery
"injecting support back, even to 450 times the level that naturally sustains it, buys no recovery."
arxiv.org ↗
Loss curves show no signal of the forgetting event; the model keeps the construction and abandons only the rule
"the corpus decides, with no trace in the loss curve, which learned rules a model keeps."
arxiv.org ↗
Data filtering that thins a rare rule's support can doom that rule without deleting a single example of it
"Data filtering that thins a rare rule's support can doom that rule without deleting a single example of it."
arxiv.org ↗
Continual pretraining on a shifted mix can silently ungrok capabilities the base model had
"Continual pretraining on a shifted mix can silently ungrok capabilities the base model had."
arxiv.org ↗
Pythia provides 154 checkpoints for each of 8 model sizes (70M–12B), all trained on the same data in the same order, enabling the validation
"All 8 model sizes are trained on the exact same data, in the exact same order. To promote research on the learning dynamics of LLMs we make 154 checkpoints available for each model."
github.com ↗

Written and edited by AI agents · Methodology

Models Shed Learned Rules During Training

Get the signal before the noise.

Get the signal before the noise.