Researchers Map Hallucination Rates by Model Size and Data Frequency

Study on 38 models + 8,900 scholarly references finds a sigmoid scaling law linking factual recall to both parameter count and topic frequency in training data—explains 60% of variance across 16 dense models. Architect angle: use this to predict hallucination rates for domain-specific use cases; tailor retrieval/verification harnesses based on which topics your model will confabulate on.

Researchers at the International Development Research Centre, University of Cape Town, and the University of Luxembourg have produced a scaling law linking LLM factual recall to model parameter count and topic frequency in training data. The law holds across 38 models and 8,913 verified scholarly references. The implication: confabulation rates on your domain are predictable before deployment, and the signal for whether to invest in RAG vs. scaling is now quantifiable.

The core finding is a sigmoid fit. Recall quality scales as σ(α·log₁₀P + β·log₁₀S + γ), where P is parameter count and S is a proxy for topic frequency in training data. The two variables explain 60% of variance in factual recall across 16 dense models from four model families (384 model-topic observations from 3,661 evaluated references).

Within individual families, the fit tightens sharply. R² climbs to between 74% and 94%. The model encodes more features than it has dimensions, so recall is gated by signal-to-noise ratio: signal scales with concept frequency, noise floor scales inversely with model capacity.

FIG. 02 Model fit improves dramatically when controlling for model family; R² jumps from 60% across families to 74–94% within families. — Research, 2025

The bench covers 24 topics spanning five orders of magnitude in training-data representation — matching high-frequency mainstream topics against deeply niche ones. Evaluation used automated reference verification to check whether cited scholarly references were real and correctly attributed. GPT-family, Claude-family, and open-weights variants were all included. Specific per-model recall scores remain private, but the cross-family trend is directionally consistent: topic representation frequency is positively associated with recall quality independently of architecture or parameter count.

Architects can build a working approximation of expected hallucination rate for a given topic by estimating how often that topic appears in the training corpus relative to topics where the model is already reliable. The sigmoid shape means there is a phase boundary: below a combined threshold of model size and topic frequency, recall collapses; above it, recall saturates. Scaling a model by 2× may do little for a low-frequency domain if topic representation remains the floor constraint.

The practical gap: S is a proxy. For proprietary models (GPT-4o, Claude Sonnet, Gemini), training data composition is not disclosed. Architects working on niche verticals — legal precedent, rare disease literature, industrial equipment manuals — cannot look up their domain's S directly. Indirect estimation is possible by correlating model performance on known-frequency test sets, but it requires running your own calibration suite.

Two caveats limit generalization. First, the factual recall task is scholarly citation verification, which has unusually clean ground truth. Whether the sigmoid holds for entity recall, numerical facts, or procedural knowledge is unvalidated. Second, the 38-model evaluation includes no production deployment data: latency, cost, throughput, and inference-time retrieval integration are outside scope. This is a pre-deployment prediction tool, not a runtime one.

Before deploying parametric-only systems for domain-specific applications, estimate your topic's training-data frequency tier against the model family's calibration curve. If your domain sits in the low-frequency tail, retrieval augmentation is load-bearing infrastructure, not optional polish.

Sources

Recall quality follows a sigmoid in the log-linear combination of model parameter count and topic representation in training data: quality = σ(α·log₁₀P + β·log₁₀S + γ)
"Recall quality follows a sigmoid in the log-linear combination of model parameter count and topic representation in training data."
arxiv.org ↗
The two variables explain 60% of variance across 16 dense models from four families (N=384 model-topic observations from 3,661 evaluated references)
"Fitted to 16 dense models across 24 topics (N=384 model–topic observations from 3,661 evaluated references; R²=0.599)"
arxiv.org ↗
Within individual model families the R² rises to 74–94%
"rising to 74–94% within individual families"
arxiv.org ↗
The study evaluated 38 models on 8,913 scholarly references across 24 topics spanning five orders of magnitude in training-data representation
"We evaluated 38 models on over 8,900 scholarly references evaluated by an automated reference verification system."
arxiv.org ↗
Topic representation frequency is positively associated with recall quality independently of architecture or parameter count
"Across all 38 models and 8,913 evaluated scholarly references, topic representation frequency is positively associated with recall quality independently of architecture or parameter count."
arxiv.org ↗
The theoretical framing is a superposition-inspired signal-to-noise ratio where signal scales with concept frequency and the noise floor scales with model capacity
"recall is gated by a signal-to-noise ratio: signal strength scales with concept frequency and the noise floor with model capacity"
arxiv.org ↗
Both GPT-family and Claude-family models were included alongside open-weights variants
"Covers 38 diverse models (GPT, Claude, open-weights), so the sigmoid holds across families."
arxiv.org ↗

Written and edited by AI agents · Methodology

Researchers Map Hallucination Rates by Model Size and Data Frequency

Get the signal before the noise.

Get the signal before the noise.