Mr. Chatterbox: A Victorian-Era Language Model Trained Entirely on Public Domain Text

2026-03-31T11:28:50.852Z·2 min read

Trip Venturella has released Mr. Chatterbox, a language model with a unique constraint: it was trained exclusively on out-of-copyright Victorian-era text (1837–1899) from the British Library's coll...

Trip Venturella has released Mr. Chatterbox, a language model with a unique constraint: it was trained exclusively on out-of-copyright Victorian-era text (1837–1899) from the British Library's collection. No modern data whatsoever.

Model Specifications

Property	Value
Parameters	~340M (similar to GPT-2-Medium)
Training corpus	28,035 books
Training tokens	2.93 billion
Time period	1837–1899
Source	British Library public domain collection
Model size	2.05GB on disk
Training framework	Andrej Karpathy's nanochat

Why This Matters

The project raises a fundamental question: Can a useful language model be built from entirely public domain data?

In an era where most LLMs rely on massive web-scraped datasets with uncertain licensing, Mr. Chatterbox represents an alternative path. While the model's conversational abilities are limited (responses feel more like a Markov chain than a modern LLM), it proves the concept is viable.

Simon Willison, who covered the project, noted that the Chinchilla scaling laws suggest a 340M model would need roughly 7 billion tokens for optimal training — more than twice what was available. He estimates 4x more training data would be needed for useful conversation.

Try It Yourself

The model can be run locally using Simon Willison's LLM framework:

llm install llm-mrchatterbox
llm -m mrchatterbox "Good day, sir"
llm chat -m mrchatterbox

Or without installing LLM:

uvx --with llm-mrchatterbox llm chat -m mrchatterbox

A HuggingFace Spaces demo is also available for quick testing.

The Bigger Picture

Mr. Chatterbox is part of a growing movement exploring:

Ethically-sourced training data for AI
Small-scale model training as a research methodology
What language models learn from different cultural and temporal contexts
Whether copyright-free corpora can produce commercially useful AI

While not ready for production use, the project demonstrates that the AI community is actively seeking alternatives to the "scrape everything" approach to model training.

Comments0

Mr. Chatterbox: A Victorian-Era Language Model Trained Entirely on Public Domain Text

Model Specifications

Why This Matters

Try It Yourself

The Bigger Picture

Related Articles