Mr. Chatterbox: A Victorian-Era Language Model Trained Entirely on Public Domain Text

2026-03-31T11:28:50.852Z·2 min read
Trip Venturella has released Mr. Chatterbox, a language model with a unique constraint: it was trained exclusively on out-of-copyright Victorian-era text (1837–1899) from the British Library's coll...

Trip Venturella has released Mr. Chatterbox, a language model with a unique constraint: it was trained exclusively on out-of-copyright Victorian-era text (1837–1899) from the British Library's collection. No modern data whatsoever.

Model Specifications

PropertyValue
Parameters~340M (similar to GPT-2-Medium)
Training corpus28,035 books
Training tokens2.93 billion
Time period1837–1899
SourceBritish Library public domain collection
Model size2.05GB on disk
Training frameworkAndrej Karpathy's nanochat

Why This Matters

The project raises a fundamental question: Can a useful language model be built from entirely public domain data?

In an era where most LLMs rely on massive web-scraped datasets with uncertain licensing, Mr. Chatterbox represents an alternative path. While the model's conversational abilities are limited (responses feel more like a Markov chain than a modern LLM), it proves the concept is viable.

Simon Willison, who covered the project, noted that the Chinchilla scaling laws suggest a 340M model would need roughly 7 billion tokens for optimal training — more than twice what was available. He estimates 4x more training data would be needed for useful conversation.

Try It Yourself

The model can be run locally using Simon Willison's LLM framework:

llm install llm-mrchatterbox
llm -m mrchatterbox "Good day, sir"
llm chat -m mrchatterbox

Or without installing LLM:

uvx --with llm-mrchatterbox llm chat -m mrchatterbox

A HuggingFace Spaces demo is also available for quick testing.

The Bigger Picture

Mr. Chatterbox is part of a growing movement exploring:

While not ready for production use, the project demonstrates that the AI community is actively seeking alternatives to the "scrape everything" approach to model training.

← Previous: Google Releases TimesFM 2.5: 200M-Parameter Time Series Foundation Model with 16K ContextNext: US Markets React to Iran Tensions and Fed Signals: Semiconductors Drop 4%, Dollar Extends Gains →
Comments0