Parakeet.cpp – Parakeet ASR inference in pure C++ with Metal GPU acceleration

2026-02-27T07:44:36.000Z·★ 92·6 min read

Fast speech recognition with NVIDIA's [Parakeet](https://huggingface.co/collections/nvidia/parakeet-702d03111484ef) models in pure C++. Built on [axiom](https://github.com/noahkay13/axiom) -- a...

Fast speech recognition with NVIDIA's Parakeet models in pure C++. Built on axiom -- a...

Title: GitHub - Frikallo/parakeet.cpp: Ultra fast and portable Parakeet implementation for on-device inference in C++ using Axiom with MPS+Unified Memory and Cuda support

URL Source: https://github.com/Frikallo/parakeet.cpp

Markdown Content:

parakeet.cpp

[](https://github.com/Frikallo/parakeet.cpp#parakeetcpp)

Fast speech recognition with NVIDIA's Parakeet models in pure C++.

Built on axiom -- a lightweight tensor library with automatic Metal GPU acceleration. No ONNX runtime, no Python runtime, no heavyweight dependencies. Just C++ and one tensor library that outruns PyTorch MPS.

Supported Models

[](https://github.com/Frikallo/parakeet.cpp#supported-models)

Model	Class	Size	Type	Description
`eou-120m`	`ParakeetEOU`	120M	Streaming	English, RNNT with end-of-utterance detection
`nemotron-600m`	`ParakeetNemotron`	600M	Streaming	Multilingual, configurable latency (80ms-1120ms)
`sortformer`	`Sortformer`	117M	Streaming	Speaker diarization (up to 4 speakers)

#include <parakeet/parakeet.hpp>

parakeet::Transcriber t("model.safetensors", "vocab.txt");

t.to_gpu(); // optional -- Metal acceleration

auto result = t.transcribe("audio.wav");

std::cout << result.text << std::endl;

Word-level timestamps:

for (const auto &w : result.word_timestamps) {

}

High-Level API

[](https://github.com/Frikallo/parakeet.cpp#high-level-api)

Offline Transcription (TDT-CTC 110M)

[](https://github.com/Frikallo/parakeet.cpp#offline-transcription-tdt-ctc-110m)

parakeet::Transcriber t("model.safetensors", "vocab.txt");

t.to_gpu();

auto result = t.transcribe("audio.wav");

Offline Transcription (TDT 600M Multilingual)

[](https://github.com/Frikallo/parakeet.cpp#offline-transcription-tdt-600m-multilingual)

parakeet::TDTTranscriber t("model.safetensors", "vocab.txt",

parakeet::make_tdt_600m_config());

auto result = t.transcribe("audio.wav");

Streaming Transcription (EOU 120M)

[](https://github.com/Frikallo/parakeet.cpp#streaming-transcription-eou-120m)

parakeet::StreamingTranscriber t("model.safetensors", "vocab.txt",

parakeet::make_eou_120m_config());

// Feed audio chunks (e.g., from microphone)

while (auto chunk = get_audio_chunk()) {

auto text = t.transcribe_chunk(chunk);

if (!text.empty()) std::cout << text << std::flush;

}

std::cout << t.get_text() << std::endl;

Streaming Transcription (Nemotron 600M)

[](https://github.com/Frikallo/parakeet.cpp#streaming-transcription-nemotron-600m)

// Latency modes: 0=80ms, 1=160ms, 6=560ms, 13=1120ms

auto cfg = parakeet::make_nemotron_600m_config(/latency_frames=/1);

parakeet::NemotronTranscriber t("model.safetensors", "vocab.txt", cfg);

while (auto chunk = get_audio_chunk()) {

auto text = t.transcribe_chunk(chunk);

if (!text.empty()) std::cout << text << std::flush;

}

Speaker Diarization (Sortformer 117M)

[](https://github.com/Frikallo/parakeet.cpp#speaker-diarization-sortformer-117m)

Identify who spoke when -- detects up to 4 speakers with per-frame activity probabilities:

parakeet::Sortformer model(parakeet::make_sortformer_117m_config());

model.load_state_dict(axiom::io::safetensors::load("sortformer.safetensors"));

auto wav = parakeet::read_wav("meeting.wav");

auto features = parakeet::preprocess_audio(wav.samples, {.normalize = false});

auto segments = model.diarize(features);

for (const auto &seg : segments) {

std::cout << "Speaker " << seg.speaker_id

}

// Speaker 0: [0.56s - 2.96s]

// Speaker 0: [3.36s - 4.40s]

// Speaker 1: [4.80s - 6.24s]

Streaming diarization with arrival-order speaker tracking:

parakeet::Sortformer model(parakeet::make_sortformer_117m_config());

model.load_state_dict(axiom::io::safetensors::load("sortformer.safetensors"));

parakeet::AOSCCache aosc_cache(4); // max 4 speakers

while (auto chunk = get_audio_chunk()) {

auto features = parakeet::preprocess_audio(chunk, {.normalize = false});

auto segments = model.diarize_chunk(features, enc_cache, aosc_cache);

for (const auto &seg : segments) {

std::cout << "Speaker " << seg.speaker_id

}

Low-Level API

[](https://github.com/Frikallo/parakeet.cpp#low-level-api)

For full control over the pipeline:

CTC (English, punctuation & capitalization):

auto cfg = parakeet::make_110m_config();

parakeet::ParakeetTDTCTC model(cfg);

model.load_state_dict(axiom::io::safetensors::load("model.safetensors"));

auto wav = parakeet::read_wav("audio.wav");

auto features = parakeet::preprocess_audio(wav.samples);

parakeet::Tokenizer tokenizer;

tokenizer.load("vocab.txt");

TDT (Token-and-Duration Transducer):

Timestamps (CTC or TDT):

// CTC timestamps

// TDT timestamps

// Group into word-level timestamps

auto words = parakeet::group_timestamps(ts[0], tokenizer.pieces());

GPU acceleration (Metal):

model.to(axiom::Device::GPU);

auto features_gpu = features.gpu();

);

CLI

[](https://github.com/Frikallo/parakeet.cpp#cli)

Usage: parakeet <model.safetensors> <audio.wav> [options]

Model types:
  --model TYPE     Model type (default: tdt-ctc-110m)
                   Types: tdt-ctc-110m, tdt-600m, eou-120m,
                          nemotron-600m, sortformer

Other options:
  --vocab PATH     SentencePiece vocab file
  --gpu            Run on Metal GPU
  --timestamps     Show word-level timestamps
  --streaming      Use streaming mode (eou/nemotron models)
  --latency N      Right context frames for nemotron (0/1/6/13)
  --features PATH  Load pre-computed features from .npy file

Examples:

./build/parakeet model.safetensors audio.wav --vocab vocab.txt

./build/parakeet model.safetensors audio.wav --vocab vocab.txt --ctc

GPU acceleration

./build/parakeet model.safetensors audio.wav --vocab vocab.txt --gpu

Word-level timestamps

./build/parakeet model.safetensors audio.wav --vocab vocab.txt --timestamps

600M multilingual TDT model

./build/parakeet model.safetensors audio.wav --vocab vocab.txt --model tdt-600m

Streaming with EOU

./build/parakeet model.safetensors audio.wav --vocab vocab.txt --model eou-120m

Nemotron streaming with configurable latency

./build/parakeet model.safetensors audio.wav --vocab vocab.txt --model nemotron-600m --latency 6

Speaker diarization

./build/parakeet sortformer.safetensors meeting.wav --model sortformer

Speaker 1: [4.80s - 6.24s]

Setup

[](https://github.com/Frikallo/parakeet.cpp#setup)

Build

[](https://github.com/Frikallo/parakeet.cpp#build)

Requires C++20. Axiom is the only dependency (included as a submodule).

cd parakeet.cpp

make build

Test

[](https://github.com/Frikallo/parakeet.cpp#test)

make test

Convert Weights

[](https://github.com/Frikallo/parakeet.cpp#convert-weights)

Convert to safetensors

pip install safetensors torch

python scripts/convert_nemo.py parakeet-tdt_ctc-110m.nemo -o model.safetensors

The converter supports all model types via the --model flag:

110M TDT-CTC (default)

python scripts/convert_nemo.py checkpoint.nemo -o model.safetensors --model 110m-tdt-ctc

600M multilingual TDT

python scripts/convert_nemo.py checkpoint.nemo -o model.safetensors --model 600m-tdt

120M EOU streaming

python scripts/convert_nemo.py checkpoint.nemo -o model.safetensors --model eou-120m

600M Nemotron streaming

python scripts/convert_nemo.py checkpoint.nemo -o model.safetensors --model nemotron-600m

117M Sortformer diarization

python scripts/convert_nemo.py checkpoint.nemo -o model.safetensors --model sortformer

python scripts/convert_nemo.py model_weights.ckpt -o model.safetensors

python scripts/convert_nemo.py --dump model.nemo # inspect checkpoint keys

Extract from .nemo

tar xf parakeet-tdt_ctc-110m.nemo ./tokenizer.model

or use the vocab.txt from the HF files page

Architecture

[](https://github.com/Frikallo/parakeet.cpp#architecture)

Offline Models

[](https://github.com/Frikallo/parakeet.cpp#offline-models)

CTC	`ParakeetCTC`	Greedy argmax	Fast, English-only
RNNT	`ParakeetRNNT`	Autoregressive LSTM	Streaming capable
TDT	`ParakeetTDT`	LSTM + duration prediction	Better accuracy than RNNT

Streaming Models

[](https://github.com/Frikallo/parakeet.cpp#streaming-models)

EOU	`ParakeetEOU`	Streaming RNNT	End-of-utterance detection
Nemotron	`ParakeetNemotron`	Streaming TDT	Configurable latency streaming

Diarization

[](https://github.com/Frikallo/parakeet.cpp#diarization)

Model	Class	Architecture	Use case

Benchmarks

[](https://github.com/Frikallo/parakeet.cpp#benchmarks)

Model	Params	CPU (ms)	GPU (ms)	GPU Speedup
110m (TDT-CTC)	110M	2,581	27	96x
tdt-600m	600M	10,779	520	21x
rnnt-600m	600M	10,648	1,468	7x
sortformer	117M	3,195	479	7x

110m GPU scaling across audio lengths:

Audio	CPU (ms)	GPU (ms)	RTF	Throughput
1s	262	24	0.024	41x
5s	1,222	26	0.005	190x
10s	2,581	27	0.003	370x
30s	10,061	32	0.001	935x
60s	26,559	72	0.001	833x

Running benchmarks

[](https://github.com/Frikallo/parakeet.cpp#running-benchmarks)

Full suite

make bench ARGS="--110m=models/model.safetensors --tdt-600m=models/tdt.safetensors"

Single model

make bench-single ARGS="--110m=models/model.safetensors --benchmark_filter=110m"

Markdown table output

./build/parakeet_bench --110m=models/model.safetensors --markdown

Skip GPU benchmarks

./build/parakeet_bench --110m=models/model.safetensors --no-gpu

Available model flags: --110m, --tdt-600m, --rnnt-600m, --sortformer. All Google Benchmark flags (--benchmark_filter, --benchmark_format=json, --benchmark_repetitions=N) are passed through.

Notes

[](https://github.com/Frikallo/parakeet.cpp#notes)

Audio: 16kHz mono WAV (16-bit PCM or 32-bit float)
Offline models have ~4-5 minute audio length limits; split longer files or use streaming models
Blank token ID is 1024 (110M) or 8192 (600M)
GPU acceleration requires Apple Silicon with Metal support
Timestamps use frame-level alignment: frame * 0.08s (8x subsampling × 160 hop / 16kHz)
Sortformer diarization uses unnormalized features (normalize = false) -- this differs from ASR models

License

[](https://github.com/Frikallo/parakeet.cpp#license)

MIT

↗ Original source

ai software hardware

Comments0

Parakeet.cpp – Parakeet ASR inference in pure C++ with Metal GPU acceleration

Offline Transcription (TDT-CTC 110M)

Offline Transcription (TDT 600M Multilingual)

Streaming Transcription (EOU 120M)

Streaming Transcription (Nemotron 600M)

Speaker Diarization (Sortformer 117M)

GPU acceleration

Word-level timestamps

600M multilingual TDT model

Streaming with EOU

Nemotron streaming with configurable latency

Speaker diarization

Speaker 1: [4.80s - 6.24s]

Build

Test

Convert Weights

Convert to safetensors

110M TDT-CTC (default)

600M multilingual TDT

120M EOU streaming

600M Nemotron streaming

117M Sortformer diarization

Extract from .nemo

or use the vocab.txt from the HF files page

Offline Models

Streaming Models

Diarization

Running benchmarks

Full suite

Single model

Markdown table output

Skip GPU benchmarks

Related Articles