Importing the Entire Linux Kernel Git History into PostgreSQL: 1.4 Million Commits as SQL
pgit: The Linux Kernel as a SQL Database — 1.4 Million Commits, 20 Years of Development
A developer has successfully imported the complete Linux kernel git history into PostgreSQL using pgit, a Git-like CLI where everything lives in a SQL database instead of the filesystem. The project hit HN with 151 points and 37 comments.
Scale of the Import
| Metric | Value |
|---|---|
| Commits | 1,428,882 |
| File versions | 24,384,844 |
| Unique blobs | 3,089,589 |
| Unique paths | 171,525 |
| Contributors | 38,000 |
| Import time | 2 hours |
| Actual data size | 2.7 GB (vs git gc aggressive: 1.95 GB) |
Hardware Used
- CPU: AMD EPYC 7401P (24 cores / 48 threads)
- RAM: 512 GB DDR4 ECC
- Storage: 2x1.92 TB SSD in RAID 0
- Location: Hetzner Finland datacenter (~272 EUR/month)
- Cache: 350 GB xpatch content cache keeping entire repository in memory
Why This Matters
The import makes the entire Linux kernel development history SQL-queryable, enabling analyses impossible or extremely difficult with git:
- 7 f-bombs found across 1.4 million commit messages (all from just 2 people)
- 665 bug fixes pointing at a single commit
- A filesystem that took 13 years to merge
- Line-by-line blame queries across the entire history
- Cross-file change correlation analysis
Technical Approach
pgit uses pg-xpatch for transparent delta compression. Few version control systems besides git have ever managed a full kernel import — Fossil never did, Darcs and Monotone had severe performance issues, and Mercurial can handle it. PostgreSQL with pgit handled it in 2 hours.
Implications
This demonstrates that PostgreSQL can serve as a viable backend for version control at massive scale. The ability to query 20 years of development history with SQL opens up new possibilities for code archaeology, developer analytics, and large-scale codebase understanding.
Source: oseifert.ch — 151 points on HN