Alec Radford Releases 13B Model Trained on Pre-1931 Text Under Apache 2.0

Alec Radford, Nick Levine, and David Duvenaud have released talkie-1930, a family of 13B-parameter language models trained on 260 billion tokens of pre-1931 English text — all out-of-copyright — under an Apache 2.0 license. The release includes a base model (talkie-1930-13b-base, 53.1 GB), an instruction-tuned variant (talkie-1930-13b-it, 26.6 GB), and a control model trained on FineWeb with identical architecture and training FLOPs (talkie-web-13b-base) for controlled comparisons between vintage and modern corpora.

The base model required 260B tokens of curated historical English. The instruction-tuned checkpoint was post-trained on a dataset extracted from pre-1931 reference works — etiquette manuals, letter-writing manuals, encyclopedias, cookbooks, and poetry collections — then pushed through online direct preference optimization with Claude Sonnet 4.6 as the reward judge. A final supervised fine-tuning round used rejection-sampled multi-turn synthetic dialogues generated between Claude Opus 4.6 and talkie itself. The team acknowledges the contamination this introduces: "reinforcement learning with AI feedback inevitably shapes talkie's behavior anachronistically," the report notes, citing the 7B talkie variant emerging from RL "speaking in listicles" as evidence.

The plan to eliminate that contamination: bootstrap era-appropriate judges from the vintage base models — replacing Claude with a 1930-era model in a closed loop. That requires sufficient scale to make the vintage model a credible judge, which the team treats as an open research problem.

FIG. 02 talkie-1930 training pipeline: pre-1931 corpus → base model → DPO + SFT with modern AI judges → instruction-tuned checkpoint, with a planned fully bootstrapped era-appropriate post-training stage. — talkie-lm/talkie, GitHub

For enterprise teams navigating training-data IP liability, the data provenance is clean. The U.S. copyright cutoff is January 1, 1931; every token in the base model predates it. Radford and co-authors note that subject-matter distribution, not just temporal coverage, differs between the vintage and FineWeb corpora, so behavioral differences cannot be attributed to the date cutoff alone. The talkie-web-13b-base control model exists to isolate that variable.

The research agenda distinguishes talkie from a novelty project. The team uses talkie to probe three questions: first, how well a period-bounded model can assign probability to future historical events ("the surprisingness of short descriptions of historical events to a 13B model trained on pre-1931 text"); second, whether such a model can independently re-derive post-cutoff science — an open question Demis Hassabis has framed as whether a model trained through 1911 could rediscover General Relativity as Einstein did in 1915; and third, whether few-shot prompting can teach a pre-modern model to write correct Python programs, tested using demonstration examples.

Running talkie requires a CUDA GPU with at least 28 GB VRAM for bfloat16 inference and between 26 and 50 GB of disk per model checkpoint. The Python API and CLI install via a single GitHub clone and uv sync. Both the base and instruct models are available on Hugging Face under the talkie-lm organization; the training corpus has not yet been released, though the authors have flagged it as a future possibility given its public-domain status.

The core bet: temporal constraint is a productive experimental variable, not a limitation. If a model with no exposure to post-1930 science can, given only pre-1930 physics literature, generate text that converges on relativistic mechanics, that's a strong signal about what language models do when they generalize. That result hasn't been demonstrated — talkie is the tool built to attempt it.

Sources

talkie-1930 is developed by Alec Radford (GPT, GPT-2, Whisper), Nick Levine, and David Duvenaud
"New project from Nick Levine, David Duvenaud, and Alec Radford (of GPT, GPT-2, Whisper fame)."
simonwillison.net ↗
talkie-1930-13b-base is a 13B language model trained on 260B tokens of historical pre-1931 English text
"talkie-1930-13b-base (53.1 GB) is a "13B language model trained on 260B tokens of historical pre-1931 English text"."
simonwillison.net ↗
talkie-1930-13b-base is 53.1 GB; talkie-1930-13b-it is 26.6 GB
"talkie-1930-13b-base (53.1 GB) ... talkie-1930-13b-it (26.6 GB)"
simonwillison.net ↗
Both models are released under the Apache 2.0 license
"Both models are Apache 2.0 licensed."
simonwillison.net ↗
The instruction-tuned model used Claude Sonnet 4.6 as a reward judge for online DPO
"We then ran online direct preference optimization on rollouts generated from these prompts, using Claude Sonnet 4.6 as a judge."
simonwillison.net ↗
A final SFT round used rejection-sampled multi-turn synthetic chats between Claude Opus 4.6 and talkie
"we did another round of supervised fine-tuning, this time on rejection-sampled multi-turn synthetic chats between Claude Opus 4.6 and talkie, to smooth out persistent rough edges in its conversational abilities."
simonwillison.net ↗
RLHF with AI feedback inevitably shapes talkie's behavior anachronistically; the 7B variant emerged from RL speaking in listicles
"reinforcement learning with AI feedback inevitably shapes talkie's behavior anachronistically. (The 7B version of talkie emerged from RL speaking in listicles.)"
simonwillison.net ↗
The team's roadmap is to use vintage base models themselves as judges for a fully bootstrapped era-appropriate post-training pipeline
"As we scale up, we hope to be able to use our vintage base models themselves as judges to enable a fully bootstrapped era-appropriate post-training pipeline."
simonwillison.net ↗
The US copyright cutoff date is currently January 1, 1931
"Since the training data for the base model is entirely out of copyright (the USA copyright cutoff date is currently January 1, 1931)"
simonwillison.net ↗
One research question is whether a model trained through 1911 could independently discover General Relativity as Einstein did in 1915
"As Demis Hassabis has asked, could a model trained up to 1911 independently discover General Relativity, as Einstein did in 1915?"
simonwillison.net ↗
The team tests 'surprisingness' of historical events to a 13B model trained on pre-1931 text
"we calculated the surprisingness of short descriptions of historical events to a 13B model trained on pre-1931 text"
simonwillison.net ↗
A control model talkie-web-13b-base uses the same architecture and training FLOPs as talkie-1930 but is trained on FineWeb
"We also provide a 'modern' base model, talkie-web-13b-base, with the same architecture and training FLOPs as talkie-1930, but trained on FineWeb, to allow for controlled comparisons between modern and vintage models."
github.com ↗
Running talkie requires a CUDA GPU with at least 28 GB VRAM for bfloat16 inference and 26–50 GB disk space per model
"CUDA GPU with >= 28 GB VRAM (bfloat16 inference) ~26-50 GB disk space per model"
github.com ↗
The instruction-tuned model was built from pre-1931 reference works including etiquette manuals, letter-writing manuals, encyclopedias, and poetry collections
"talkie-1930-13b-it has been instruction-tuned using a novel instruction-following dataset built from pre-1931 reference works including etiquette manuals, letter-writing manuals, encyclopedias, and poetry collections."
github.com ↗

Written and edited by AI agents · Methodology

Alec Radford Releases 13B Model Trained on Pre-1931 Text Under Apache 2.0

Get the signal before the noise.

Get the signal before the noise.