GitHub - py-ai-dev/wav2vec2.cpp: Fast CPU inference for wav2vec2 ASR models, no Python required

Fast CPU inference for wav2vec2 ASR models — no Python, no PyTorch, no CUDA required.

whisper.cpp did this for Whisper. wav2vec2 has thousands of community fine-tuned models on HuggingFace — Telugu, Tamil, Arabic, Hindi, Swahili, Bangla, 100+ languages — and none of them had a C++ inference path until now.

Features

ggml-powered — ships ggml as a submodule: quantized matmul, SIMD kernels (NEON / AVX2), and a clear upgrade path to Metal and CUDA
Q8_0 and Q4_0 quantization — linear weights quantized at convert time; inference dequantizes on-the-fly. wav2vec2-base drops from 360 MB (F32) to 96 MB (Q8_0) or 48 MB (Q4_0)
Universal — converts any Wav2Vec2ForCTC model from HuggingFace to GGUF
Built-in WAV reader — no libsndfile, no miniaudio; just a file path
Greedy + beam search CTC — greedy by default (-b 1), full prefix beam search with -b N
Multi-threaded attention — transformer heads split across CPU cores (-t N)
ARM + x86 — ggml auto-selects NEON / AVX2 / AVX512 at configure time
Tested — 83 unit tests covering math ops, the GGUF reader, and the beam search decoder

Build

git clone https://github.com/py-ai-dev/wav2vec2.cpp
cd wav2vec2.cpp
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)

Run tests:

cd build && ctest --output-on-failure

Requirements: C++17 compiler, CMake ≥ 3.14. ggml is a submodule and builds automatically — no other dependencies needed.

git clone --recursive https://github.com/py-ai-dev/wav2vec2.cpp

(If you cloned without --recursive: git submodule update --init)

Quick Start

1. Convert a model

pip install transformers torch
python scripts/convert_to_gguf.py facebook/wav2vec2-base-960h model.gguf          # F16 (default)
python scripts/convert_to_gguf.py facebook/wav2vec2-base-960h model_q8.gguf --dtype q8_0  # ~96 MB
python scripts/convert_to_gguf.py facebook/wav2vec2-base-960h model_q4.gguf --dtype q4_0  # ~48 MB

--dtype choices: f32, f16 (default), q8_0, q4_0. Quantization applies only to linear weight matrices; norms, biases, and conv weights stay in F32.

2. Transcribe

./build/wav2vec2-cli -m model.gguf -f audio.wav

usage: wav2vec2-cli -m MODEL -f AUDIO [-t THREADS] [-b BEAM] [-v]

  -m  model.gguf    GGUF model file
  -f  audio.wav     input WAV (16 kHz mono recommended; stereo/other rates auto-handled)
  -t  4             number of threads (default: 4)
  -b  1             CTC beam width: 1 = greedy, 5 = beam search (default: 1)
  -v                verbose output with timing

Examples

# English (360 MB base model)
python scripts/convert_to_gguf.py facebook/wav2vec2-base-960h      en.gguf
./build/wav2vec2-cli -m en.gguf -f speech.wav

# Telugu
python scripts/convert_to_gguf.py vasista22/wav2vec2-telugu-large   te.gguf
./build/wav2vec2-cli -m te.gguf -f telugu.wav

# Arabic (XLS-R fine-tune)
python scripts/convert_to_gguf.py jonatasgrosman/wav2vec2-large-xlsr-53-arabic  ar.gguf
./build/wav2vec2-cli -m ar.gguf -f arabic.wav

# Any other language — same pattern
python scripts/convert_to_gguf.py <any-wav2vec2-ctc-model> out.gguf
./build/wav2vec2-cli -m out.gguf -f audio.wav

Architecture

raw audio (16 kHz float32)
  ↓
CNN Feature Extractor     7 conv layers, total stride 320 → ~49 frames/sec
  ↓
Feature Projection        linear + layer norm  [T × conv_dim → T × hidden]
  ↓
Positional Conv Embed     grouped Conv1D, adds position information
  ↓
Transformer Encoder       12–24 layers, full bidirectional self-attention
  ↓
CTC Head                  linear → argmax over vocab → remove dups/blanks
  ↓
transcript

Supported Models

Any Wav2Vec2ForCTC checkpoint with:

Config field	Supported values
`feat_extract_norm`	`"group"` (base) or `"layer"` (large/XLS-R)
`feat_extract_activation`	`"gelu"`
Architecture	standard 7-layer CNN + transformer

This covers wav2vec2-base, wav2vec2-large, XLS-R-300M, XLS-R-1B, and the vast majority of community fine-tunes.

Performance

Tested on a 20-core ARM (Cortex-X925 + A725), 120 GB RAM:

Model	Size	10s audio	RTF
wav2vec2-base-960h	360 MB	~0.8 s	0.08
wav2vec2-large-xlsr	1.18 GB	~3.2 s	0.32

RTF < 1.0 = faster than real time. Both models run comfortably on CPU with no GPU.

Repository Structure

wav2vec2.cpp/
├── include/wav2vec2.h          public C API
├── src/
│   ├── ops.h                   math primitives (gelu, layer_norm, conv1d, …)
│   ├── gguf.h                  minimal GGUF reader
│   └── wav2vec2.cpp            model loading + forward pass
├── examples/main/main.cpp      CLI tool
├── tests/
│   ├── test_ops.cpp            unit tests — all math ops
│   └── test_gguf.cpp           unit tests — GGUF reader
├── scripts/convert_to_gguf.py  HuggingFace → GGUF converter
└── CMakeLists.txt

Roadmap

Contributing

PRs welcome. Please add or update tests for any changed logic. Run ctest before submitting.

License

MIT — see LICENSE.

Produced by liodon-ai.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
examples/main		examples/main
include		include
media		media
scripts		scripts
src		src
tests		tests
third_party		third_party
.gitignore		.gitignore
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Features

Build

Quick Start

1. Convert a model

2. Transcribe

Examples

Architecture

Supported Models

Performance

Repository Structure

Roadmap

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Features

Build

Quick Start

1. Convert a model

2. Transcribe

Examples

Architecture

Supported Models

Performance

Repository Structure

Roadmap

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages