Mr. Chatterbox: A Victorian-Era Language Model Trained Entirely on Public Domain Text
Trip Venturella has released Mr. Chatterbox, a language model with a unique constraint: it was trained exclusively on out-of-copyright Victorian-era text (1837–1899) from the British Library's collection. No modern data whatsoever.
Model Specifications
| Property | Value |
|---|---|
| Parameters | ~340M (similar to GPT-2-Medium) |
| Training corpus | 28,035 books |
| Training tokens | 2.93 billion |
| Time period | 1837–1899 |
| Source | British Library public domain collection |
| Model size | 2.05GB on disk |
| Training framework | Andrej Karpathy's nanochat |
Why This Matters
The project raises a fundamental question: Can a useful language model be built from entirely public domain data?
In an era where most LLMs rely on massive web-scraped datasets with uncertain licensing, Mr. Chatterbox represents an alternative path. While the model's conversational abilities are limited (responses feel more like a Markov chain than a modern LLM), it proves the concept is viable.
Simon Willison, who covered the project, noted that the Chinchilla scaling laws suggest a 340M model would need roughly 7 billion tokens for optimal training — more than twice what was available. He estimates 4x more training data would be needed for useful conversation.
Try It Yourself
The model can be run locally using Simon Willison's LLM framework:
llm install llm-mrchatterbox
llm -m mrchatterbox "Good day, sir"
llm chat -m mrchatterbox
Or without installing LLM:
uvx --with llm-mrchatterbox llm chat -m mrchatterbox
A HuggingFace Spaces demo is also available for quick testing.
The Bigger Picture
Mr. Chatterbox is part of a growing movement exploring:
- Ethically-sourced training data for AI
- Small-scale model training as a research methodology
- What language models learn from different cultural and temporal contexts
- Whether copyright-free corpora can produce commercially useful AI
While not ready for production use, the project demonstrates that the AI community is actively seeking alternatives to the "scrape everything" approach to model training.