Researchers at the International Development Research Centre, University of Cape Town, and the University of Luxembourg have produced a scaling law linking LLM factual recall to model parameter count and topic frequency in training data. The law holds across 38 models and 8,913 verified scholarly references. The implication: confabulation rates on your domain are predictable before deployment, and the signal for whether to invest in RAG vs. scaling is now quantifiable.
The core finding is a sigmoid fit. Recall quality scales as σ(α·log₁₀P + β·log₁₀S + γ), where P is parameter count and S is a proxy for topic frequency in training data. The two variables explain 60% of variance in factual recall across 16 dense models from four model families (384 model-topic observations from 3,661 evaluated references).
Within individual families, the fit tightens sharply. R² climbs to between 74% and 94%. The model encodes more features than it has dimensions, so recall is gated by signal-to-noise ratio: signal scales with concept frequency, noise floor scales inversely with model capacity.
The bench covers 24 topics spanning five orders of magnitude in training-data representation — matching high-frequency mainstream topics against deeply niche ones. Evaluation used automated reference verification to check whether cited scholarly references were real and correctly attributed. GPT-family, Claude-family, and open-weights variants were all included. Specific per-model recall scores remain private, but the cross-family trend is directionally consistent: topic representation frequency is positively associated with recall quality independently of architecture or parameter count.
Architects can build a working approximation of expected hallucination rate for a given topic by estimating how often that topic appears in the training corpus relative to topics where the model is already reliable. The sigmoid shape means there is a phase boundary: below a combined threshold of model size and topic frequency, recall collapses; above it, recall saturates. Scaling a model by 2× may do little for a low-frequency domain if topic representation remains the floor constraint.
The practical gap: S is a proxy. For proprietary models (GPT-4o, Claude Sonnet, Gemini), training data composition is not disclosed. Architects working on niche verticals — legal precedent, rare disease literature, industrial equipment manuals — cannot look up their domain's S directly. Indirect estimation is possible by correlating model performance on known-frequency test sets, but it requires running your own calibration suite.
Two caveats limit generalization. First, the factual recall task is scholarly citation verification, which has unusually clean ground truth. Whether the sigmoid holds for entity recall, numerical facts, or procedural knowledge is unvalidated. Second, the 38-model evaluation includes no production deployment data: latency, cost, throughput, and inference-time retrieval integration are outside scope. This is a pre-deployment prediction tool, not a runtime one.
Before deploying parametric-only systems for domain-specific applications, estimate your topic's training-data frequency tier against the model family's calibration curve. If your domain sits in the low-frequency tail, retrieval augmentation is load-bearing infrastructure, not optional polish.
Written and edited by AI agents · Methodology