Parakeet.cpp – Parakeet ASR inference in pure C++ with Metal GPU acceleration

2026-02-27T07:44:36.000Z·★ 92·6 min read
Fast speech recognition with NVIDIA's [Parakeet](https://huggingface.co/collections/nvidia/parakeet-702d03111484ef) models in pure C++. Built on [axiom](https://github.com/noahkay13/axiom) -- a...

Fast speech recognition with NVIDIA's Parakeet models in pure C++. Built on axiom -- a...


Title: GitHub - Frikallo/parakeet.cpp: Ultra fast and portable Parakeet implementation for on-device inference in C++ using Axiom with MPS+Unified Memory and Cuda support

URL Source: https://github.com/Frikallo/parakeet.cpp

Markdown Content:

parakeet.cpp


[](https://github.com/Frikallo/parakeet.cpp#parakeetcpp)

Fast speech recognition with NVIDIA's Parakeet models in pure C++.

Built on axiom -- a lightweight tensor library with automatic Metal GPU acceleration. No ONNX runtime, no Python runtime, no heavyweight dependencies. Just C++ and one tensor library that outruns PyTorch MPS.

Supported Models


[](https://github.com/Frikallo/parakeet.cpp#supported-models)

ModelClassSizeTypeDescription
eou-120mParakeetEOU120MStreamingEnglish, RNNT with end-of-utterance detection
nemotron-600mParakeetNemotron600MStreamingMultilingual, configurable latency (80ms-1120ms)
sortformerSortformer117MStreamingSpeaker diarization (up to 4 speakers)

#include <parakeet/parakeet.hpp>

parakeet::Transcriber t("model.safetensors", "vocab.txt");

t.to_gpu(); // optional -- Metal acceleration

auto result = t.transcribe("audio.wav");

std::cout << result.text << std::endl;

Word-level timestamps:

for (const auto &w : result.word_timestamps) {

}

High-Level API


[](https://github.com/Frikallo/parakeet.cpp#high-level-api)

Offline Transcription (TDT-CTC 110M)

[](https://github.com/Frikallo/parakeet.cpp#offline-transcription-tdt-ctc-110m)

parakeet::Transcriber t("model.safetensors", "vocab.txt");

t.to_gpu();

auto result = t.transcribe("audio.wav");

Offline Transcription (TDT 600M Multilingual)

[](https://github.com/Frikallo/parakeet.cpp#offline-transcription-tdt-600m-multilingual)

parakeet::TDTTranscriber t("model.safetensors", "vocab.txt",

parakeet::make_tdt_600m_config());

auto result = t.transcribe("audio.wav");

Streaming Transcription (EOU 120M)

[](https://github.com/Frikallo/parakeet.cpp#streaming-transcription-eou-120m)

parakeet::StreamingTranscriber t("model.safetensors", "vocab.txt",

parakeet::make_eou_120m_config());

// Feed audio chunks (e.g., from microphone)

while (auto chunk = get_audio_chunk()) {

auto text = t.transcribe_chunk(chunk);

if (!text.empty()) std::cout << text << std::flush;

}

std::cout << t.get_text() << std::endl;

Streaming Transcription (Nemotron 600M)

[](https://github.com/Frikallo/parakeet.cpp#streaming-transcription-nemotron-600m)

// Latency modes: 0=80ms, 1=160ms, 6=560ms, 13=1120ms

auto cfg = parakeet::make_nemotron_600m_config(/latency_frames=/1);

parakeet::NemotronTranscriber t("model.safetensors", "vocab.txt", cfg);

while (auto chunk = get_audio_chunk()) {

auto text = t.transcribe_chunk(chunk);

if (!text.empty()) std::cout << text << std::flush;

}

Speaker Diarization (Sortformer 117M)

[](https://github.com/Frikallo/parakeet.cpp#speaker-diarization-sortformer-117m)

Identify who spoke when -- detects up to 4 speakers with per-frame activity probabilities:

parakeet::Sortformer model(parakeet::make_sortformer_117m_config());

model.load_state_dict(axiom::io::safetensors::load("sortformer.safetensors"));

auto wav = parakeet::read_wav("meeting.wav");

auto features = parakeet::preprocess_audio(wav.samples, {.normalize = false});

auto segments = model.diarize(features);

for (const auto &seg : segments) {

std::cout << "Speaker " << seg.speaker_id

}

// Speaker 0: [0.56s - 2.96s]

// Speaker 0: [3.36s - 4.40s]

// Speaker 1: [4.80s - 6.24s]

Streaming diarization with arrival-order speaker tracking:

parakeet::Sortformer model(parakeet::make_sortformer_117m_config());

model.load_state_dict(axiom::io::safetensors::load("sortformer.safetensors"));

parakeet::AOSCCache aosc_cache(4); // max 4 speakers

while (auto chunk = get_audio_chunk()) {

auto features = parakeet::preprocess_audio(chunk, {.normalize = false});

auto segments = model.diarize_chunk(features, enc_cache, aosc_cache);

for (const auto &seg : segments) {

std::cout << "Speaker " << seg.speaker_id

}

}

Low-Level API


[](https://github.com/Frikallo/parakeet.cpp#low-level-api)

For full control over the pipeline:

CTC (English, punctuation & capitalization):

auto cfg = parakeet::make_110m_config();

parakeet::ParakeetTDTCTC model(cfg);

model.load_state_dict(axiom::io::safetensors::load("model.safetensors"));

auto wav = parakeet::read_wav("audio.wav");

auto features = parakeet::preprocess_audio(wav.samples);

parakeet::Tokenizer tokenizer;

tokenizer.load("vocab.txt");

TDT (Token-and-Duration Transducer):

Timestamps (CTC or TDT):

// CTC timestamps

// TDT timestamps

// Group into word-level timestamps

auto words = parakeet::group_timestamps(ts[0], tokenizer.pieces());

GPU acceleration (Metal):

model.to(axiom::Device::GPU);

auto features_gpu = features.gpu();

);

CLI


[](https://github.com/Frikallo/parakeet.cpp#cli)

Usage: parakeet <model.safetensors> <audio.wav> [options]

Model types:
  --model TYPE     Model type (default: tdt-ctc-110m)
                   Types: tdt-ctc-110m, tdt-600m, eou-120m,
                          nemotron-600m, sortformer

Other options:
  --vocab PATH     SentencePiece vocab file
  --gpu            Run on Metal GPU
  --timestamps     Show word-level timestamps
  --streaming      Use streaming mode (eou/nemotron models)
  --latency N      Right context frames for nemotron (0/1/6/13)
  --features PATH  Load pre-computed features from .npy file

Examples:

./build/parakeet model.safetensors audio.wav --vocab vocab.txt

./build/parakeet model.safetensors audio.wav --vocab vocab.txt --ctc

GPU acceleration

./build/parakeet model.safetensors audio.wav --vocab vocab.txt --gpu

Word-level timestamps

./build/parakeet model.safetensors audio.wav --vocab vocab.txt --timestamps

600M multilingual TDT model

./build/parakeet model.safetensors audio.wav --vocab vocab.txt --model tdt-600m

Streaming with EOU

./build/parakeet model.safetensors audio.wav --vocab vocab.txt --model eou-120m

Nemotron streaming with configurable latency

./build/parakeet model.safetensors audio.wav --vocab vocab.txt --model nemotron-600m --latency 6

Speaker diarization

./build/parakeet sortformer.safetensors meeting.wav --model sortformer

Speaker 1: [4.80s - 6.24s]

Setup


[](https://github.com/Frikallo/parakeet.cpp#setup)

Build

[](https://github.com/Frikallo/parakeet.cpp#build)

Requires C++20. Axiom is the only dependency (included as a submodule).

cd parakeet.cpp

make build

Test

[](https://github.com/Frikallo/parakeet.cpp#test)

make test

Convert Weights

[](https://github.com/Frikallo/parakeet.cpp#convert-weights)

Convert to safetensors

pip install safetensors torch

python scripts/convert_nemo.py parakeet-tdt_ctc-110m.nemo -o model.safetensors

The converter supports all model types via the --model flag:

110M TDT-CTC (default)

python scripts/convert_nemo.py checkpoint.nemo -o model.safetensors --model 110m-tdt-ctc

600M multilingual TDT

python scripts/convert_nemo.py checkpoint.nemo -o model.safetensors --model 600m-tdt

120M EOU streaming

python scripts/convert_nemo.py checkpoint.nemo -o model.safetensors --model eou-120m

600M Nemotron streaming

python scripts/convert_nemo.py checkpoint.nemo -o model.safetensors --model nemotron-600m

117M Sortformer diarization

python scripts/convert_nemo.py checkpoint.nemo -o model.safetensors --model sortformer

python scripts/convert_nemo.py model_weights.ckpt -o model.safetensors

python scripts/convert_nemo.py --dump model.nemo # inspect checkpoint keys

Extract from .nemo

tar xf parakeet-tdt_ctc-110m.nemo ./tokenizer.model

or use the vocab.txt from the HF files page

Architecture


[](https://github.com/Frikallo/parakeet.cpp#architecture)

Offline Models

[](https://github.com/Frikallo/parakeet.cpp#offline-models)

CTCParakeetCTCGreedy argmaxFast, English-only
RNNTParakeetRNNTAutoregressive LSTMStreaming capable
TDTParakeetTDTLSTM + duration predictionBetter accuracy than RNNT

Streaming Models

[](https://github.com/Frikallo/parakeet.cpp#streaming-models)

EOUParakeetEOUStreaming RNNTEnd-of-utterance detection
NemotronParakeetNemotronStreaming TDTConfigurable latency streaming

Diarization

[](https://github.com/Frikallo/parakeet.cpp#diarization)

ModelClassArchitectureUse case

Benchmarks


[](https://github.com/Frikallo/parakeet.cpp#benchmarks)

ModelParamsCPU (ms)GPU (ms)GPU Speedup
110m (TDT-CTC)110M2,5812796x
tdt-600m600M10,77952021x
rnnt-600m600M10,6481,4687x
sortformer117M3,1954797x

110m GPU scaling across audio lengths:

AudioCPU (ms)GPU (ms)RTFThroughput
1s262240.02441x
5s1,222260.005190x
10s2,581270.003370x
30s10,061320.001935x
60s26,559720.001833x

Running benchmarks

[](https://github.com/Frikallo/parakeet.cpp#running-benchmarks)

Full suite

make bench ARGS="--110m=models/model.safetensors --tdt-600m=models/tdt.safetensors"

Single model

make bench-single ARGS="--110m=models/model.safetensors --benchmark_filter=110m"

Markdown table output

./build/parakeet_bench --110m=models/model.safetensors --markdown

Skip GPU benchmarks

./build/parakeet_bench --110m=models/model.safetensors --no-gpu

Available model flags: --110m, --tdt-600m, --rnnt-600m, --sortformer. All Google Benchmark flags (--benchmark_filter, --benchmark_format=json, --benchmark_repetitions=N) are passed through.

Notes


[](https://github.com/Frikallo/parakeet.cpp#notes)

License


[](https://github.com/Frikallo/parakeet.cpp#license)

MIT

↗ Original source
← Previous: Two insider cases we've recently closedNext: Working on Pharo Smalltalk: BPatterns: Rewrite Engine with Smalltalk Style →
Comments0