Parakeet.cpp – Parakeet ASR inference in pure C++ with Metal GPU acceleration
Fast speech recognition with NVIDIA's Parakeet models in pure C++. Built on axiom -- a...
Title: GitHub - Frikallo/parakeet.cpp: Ultra fast and portable Parakeet implementation for on-device inference in C++ using Axiom with MPS+Unified Memory and Cuda support
URL Source: https://github.com/Frikallo/parakeet.cpp
Markdown Content:
parakeet.cpp
[](https://github.com/Frikallo/parakeet.cpp#parakeetcpp)
Fast speech recognition with NVIDIA's Parakeet models in pure C++.
Built on axiom -- a lightweight tensor library with automatic Metal GPU acceleration. No ONNX runtime, no Python runtime, no heavyweight dependencies. Just C++ and one tensor library that outruns PyTorch MPS.
Supported Models
[](https://github.com/Frikallo/parakeet.cpp#supported-models)
| Model | Class | Size | Type | Description |
|---|---|---|---|---|
eou-120m | ParakeetEOU | 120M | Streaming | English, RNNT with end-of-utterance detection |
nemotron-600m | ParakeetNemotron | 600M | Streaming | Multilingual, configurable latency (80ms-1120ms) |
sortformer | Sortformer | 117M | Streaming | Speaker diarization (up to 4 speakers) |
#include <parakeet/parakeet.hpp>
parakeet::Transcriber t("model.safetensors", "vocab.txt");
t.to_gpu(); // optional -- Metal acceleration
auto result = t.transcribe("audio.wav");
std::cout << result.text << std::endl;
Word-level timestamps:
for (const auto &w : result.word_timestamps) {
}
High-Level API
[](https://github.com/Frikallo/parakeet.cpp#high-level-api)
Offline Transcription (TDT-CTC 110M)
[](https://github.com/Frikallo/parakeet.cpp#offline-transcription-tdt-ctc-110m)
parakeet::Transcriber t("model.safetensors", "vocab.txt");
t.to_gpu();
auto result = t.transcribe("audio.wav");
Offline Transcription (TDT 600M Multilingual)
[](https://github.com/Frikallo/parakeet.cpp#offline-transcription-tdt-600m-multilingual)
parakeet::TDTTranscriber t("model.safetensors", "vocab.txt",
parakeet::make_tdt_600m_config());
auto result = t.transcribe("audio.wav");
Streaming Transcription (EOU 120M)
[](https://github.com/Frikallo/parakeet.cpp#streaming-transcription-eou-120m)
parakeet::StreamingTranscriber t("model.safetensors", "vocab.txt",
parakeet::make_eou_120m_config());
// Feed audio chunks (e.g., from microphone)
while (auto chunk = get_audio_chunk()) {
auto text = t.transcribe_chunk(chunk);
if (!text.empty()) std::cout << text << std::flush;
}
std::cout << t.get_text() << std::endl;
Streaming Transcription (Nemotron 600M)
[](https://github.com/Frikallo/parakeet.cpp#streaming-transcription-nemotron-600m)
// Latency modes: 0=80ms, 1=160ms, 6=560ms, 13=1120ms
auto cfg = parakeet::make_nemotron_600m_config(/latency_frames=/1);
parakeet::NemotronTranscriber t("model.safetensors", "vocab.txt", cfg);
while (auto chunk = get_audio_chunk()) {
auto text = t.transcribe_chunk(chunk);
if (!text.empty()) std::cout << text << std::flush;
}
Speaker Diarization (Sortformer 117M)
[](https://github.com/Frikallo/parakeet.cpp#speaker-diarization-sortformer-117m)
Identify who spoke when -- detects up to 4 speakers with per-frame activity probabilities:
parakeet::Sortformer model(parakeet::make_sortformer_117m_config());
model.load_state_dict(axiom::io::safetensors::load("sortformer.safetensors"));
auto wav = parakeet::read_wav("meeting.wav");
auto features = parakeet::preprocess_audio(wav.samples, {.normalize = false});
auto segments = model.diarize(features);
for (const auto &seg : segments) {
std::cout << "Speaker " << seg.speaker_id
}
// Speaker 0: [0.56s - 2.96s]
// Speaker 0: [3.36s - 4.40s]
// Speaker 1: [4.80s - 6.24s]
Streaming diarization with arrival-order speaker tracking:
parakeet::Sortformer model(parakeet::make_sortformer_117m_config());
model.load_state_dict(axiom::io::safetensors::load("sortformer.safetensors"));
parakeet::AOSCCache aosc_cache(4); // max 4 speakers
while (auto chunk = get_audio_chunk()) {
auto features = parakeet::preprocess_audio(chunk, {.normalize = false});
auto segments = model.diarize_chunk(features, enc_cache, aosc_cache);
for (const auto &seg : segments) {
std::cout << "Speaker " << seg.speaker_id
}
}
Low-Level API
[](https://github.com/Frikallo/parakeet.cpp#low-level-api)
For full control over the pipeline:
CTC (English, punctuation & capitalization):
auto cfg = parakeet::make_110m_config();
parakeet::ParakeetTDTCTC model(cfg);
model.load_state_dict(axiom::io::safetensors::load("model.safetensors"));
auto wav = parakeet::read_wav("audio.wav");
auto features = parakeet::preprocess_audio(wav.samples);
parakeet::Tokenizer tokenizer;
tokenizer.load("vocab.txt");
TDT (Token-and-Duration Transducer):
Timestamps (CTC or TDT):
// CTC timestamps
// TDT timestamps
// Group into word-level timestamps
auto words = parakeet::group_timestamps(ts[0], tokenizer.pieces());
GPU acceleration (Metal):
model.to(axiom::Device::GPU);
auto features_gpu = features.gpu();
);
CLI
[](https://github.com/Frikallo/parakeet.cpp#cli)
Usage: parakeet <model.safetensors> <audio.wav> [options]
Model types:
--model TYPE Model type (default: tdt-ctc-110m)
Types: tdt-ctc-110m, tdt-600m, eou-120m,
nemotron-600m, sortformer
Other options:
--vocab PATH SentencePiece vocab file
--gpu Run on Metal GPU
--timestamps Show word-level timestamps
--streaming Use streaming mode (eou/nemotron models)
--latency N Right context frames for nemotron (0/1/6/13)
--features PATH Load pre-computed features from .npy file
Examples:
./build/parakeet model.safetensors audio.wav --vocab vocab.txt
./build/parakeet model.safetensors audio.wav --vocab vocab.txt --ctc
GPU acceleration
./build/parakeet model.safetensors audio.wav --vocab vocab.txt --gpu
Word-level timestamps
./build/parakeet model.safetensors audio.wav --vocab vocab.txt --timestamps
600M multilingual TDT model
./build/parakeet model.safetensors audio.wav --vocab vocab.txt --model tdt-600m
Streaming with EOU
./build/parakeet model.safetensors audio.wav --vocab vocab.txt --model eou-120m
Nemotron streaming with configurable latency
./build/parakeet model.safetensors audio.wav --vocab vocab.txt --model nemotron-600m --latency 6
Speaker diarization
./build/parakeet sortformer.safetensors meeting.wav --model sortformer
Speaker 1: [4.80s - 6.24s]
Setup
[](https://github.com/Frikallo/parakeet.cpp#setup)
Build
[](https://github.com/Frikallo/parakeet.cpp#build)
Requires C++20. Axiom is the only dependency (included as a submodule).
cd parakeet.cpp
make build
Test
[](https://github.com/Frikallo/parakeet.cpp#test)
make test
Convert Weights
[](https://github.com/Frikallo/parakeet.cpp#convert-weights)
Convert to safetensors
pip install safetensors torch
python scripts/convert_nemo.py parakeet-tdt_ctc-110m.nemo -o model.safetensors
The converter supports all model types via the --model flag:
110M TDT-CTC (default)
python scripts/convert_nemo.py checkpoint.nemo -o model.safetensors --model 110m-tdt-ctc
600M multilingual TDT
python scripts/convert_nemo.py checkpoint.nemo -o model.safetensors --model 600m-tdt
120M EOU streaming
python scripts/convert_nemo.py checkpoint.nemo -o model.safetensors --model eou-120m
600M Nemotron streaming
python scripts/convert_nemo.py checkpoint.nemo -o model.safetensors --model nemotron-600m
117M Sortformer diarization
python scripts/convert_nemo.py checkpoint.nemo -o model.safetensors --model sortformer
python scripts/convert_nemo.py model_weights.ckpt -o model.safetensors
python scripts/convert_nemo.py --dump model.nemo # inspect checkpoint keys
Extract from .nemo
tar xf parakeet-tdt_ctc-110m.nemo ./tokenizer.model
or use the vocab.txt from the HF files page
Architecture
[](https://github.com/Frikallo/parakeet.cpp#architecture)
Offline Models
[](https://github.com/Frikallo/parakeet.cpp#offline-models)
| CTC | ParakeetCTC | Greedy argmax | Fast, English-only |
|---|---|---|---|
| RNNT | ParakeetRNNT | Autoregressive LSTM | Streaming capable |
| TDT | ParakeetTDT | LSTM + duration prediction | Better accuracy than RNNT |
Streaming Models
[](https://github.com/Frikallo/parakeet.cpp#streaming-models)
| EOU | ParakeetEOU | Streaming RNNT | End-of-utterance detection |
|---|---|---|---|
| Nemotron | ParakeetNemotron | Streaming TDT | Configurable latency streaming |
Diarization
[](https://github.com/Frikallo/parakeet.cpp#diarization)
| Model | Class | Architecture | Use case |
|---|
Benchmarks
[](https://github.com/Frikallo/parakeet.cpp#benchmarks)
| Model | Params | CPU (ms) | GPU (ms) | GPU Speedup |
|---|---|---|---|---|
| 110m (TDT-CTC) | 110M | 2,581 | 27 | 96x |
| tdt-600m | 600M | 10,779 | 520 | 21x |
| rnnt-600m | 600M | 10,648 | 1,468 | 7x |
| sortformer | 117M | 3,195 | 479 | 7x |
110m GPU scaling across audio lengths:
| Audio | CPU (ms) | GPU (ms) | RTF | Throughput |
|---|---|---|---|---|
| 1s | 262 | 24 | 0.024 | 41x |
| 5s | 1,222 | 26 | 0.005 | 190x |
| 10s | 2,581 | 27 | 0.003 | 370x |
| 30s | 10,061 | 32 | 0.001 | 935x |
| 60s | 26,559 | 72 | 0.001 | 833x |
Running benchmarks
[](https://github.com/Frikallo/parakeet.cpp#running-benchmarks)
Full suite
make bench ARGS="--110m=models/model.safetensors --tdt-600m=models/tdt.safetensors"
Single model
make bench-single ARGS="--110m=models/model.safetensors --benchmark_filter=110m"
Markdown table output
./build/parakeet_bench --110m=models/model.safetensors --markdown
Skip GPU benchmarks
./build/parakeet_bench --110m=models/model.safetensors --no-gpu
Available model flags: --110m, --tdt-600m, --rnnt-600m, --sortformer. All Google Benchmark flags (--benchmark_filter, --benchmark_format=json, --benchmark_repetitions=N) are passed through.
Notes
[](https://github.com/Frikallo/parakeet.cpp#notes)
- Audio: 16kHz mono WAV (16-bit PCM or 32-bit float)
- Offline models have ~4-5 minute audio length limits; split longer files or use streaming models
- Blank token ID is 1024 (110M) or 8192 (600M)
- GPU acceleration requires Apple Silicon with Metal support
- Timestamps use frame-level alignment:
frame * 0.08s(8x subsampling × 160 hop / 16kHz) - Sortformer diarization uses unnormalized features (
normalize = false) -- this differs from ASR models
License
[](https://github.com/Frikallo/parakeet.cpp#license)
MIT