quant.cpp is the SQLite of LLM inference.
Not the fastest. Not the most feature-complete. The most embeddable, the most readable, and the only engine that compresses KV cache 7x without quality loss.
Need speed? → llama.cpp
Need throughput? → vLLM
Need to embed LLM in your app with one file? → quant.cpp
Need 7x longer context on the same hardware? → quant.cpp
The world's simplest way to add LLM to a C/C++ project.
- quant.h single header (15K LOC, 628KB)
- 6-function API (load, new, generate, ask, free_ctx, free_model)
- WASM build (192KB binary)
- MSVC/MinGW Windows support
- Zero external dependencies
- API documentation (docs/api.md)
- quant.h sync with latest source
- Embedding examples (minimal, chat, KV compare)
- pip install quantcpp (Python bindings)
- iOS SDK + demo app
- Android NDK build guide
- Unity C# plugin
- Unreal C++ integration
- npm package (WASM)
- GitHub Pages live demo with pre-loaded model
The reference implementation for KV cache quantization research.
- 7 quantization types (Polar, QJL, Turbo, Uniform, TurboKV)
- Delta compression (P-frame encoding)
- QK-norm aware compression
- Plugin architecture (3 functions to add new type)
- 34 unit tests
- "Add Your Own Type" tutorial (docs/custom-quantization.md)
- Arxiv tech report
- llama.cpp KV type PR (ggml type registration)
- vLLM KV compression plugin
- Benchmarking suite (PPL across models × KV types)
- Learned codebook quantization
- Per-head adaptive bit allocation
- ❌ GPU speed competition with llama.cpp (requires tensor graph IR)
- ❌ Batch serving (vLLM's domain)
- ❌ Training support
- ❌ 100+ model coverage
- One file forward pass: tq_transformer.c contains the entire inference loop
- Plugin quantization: Add types via tq_traits.c registration
- Zero dependencies: libc + pthreads only (+ Metal on macOS)
- CPU-first: NEON/AVX2 optimized, GPU as optional accelerator
- Embeddable: quant.h works anywhere a C compiler does