11## TurboQuant.cpp v0.1.0 — First Release
22
3- Pure C LLM inference engine with KV cache compression. Matches llama.cpp single-thread speed .
3+ Multi-architecture LLM inference engine in pure C with KV cache compression.
44
55### Highlights
66
7- - ** 82 tok/s peak** on Qwen3.5-0.8B (Q4, CPU-only, Apple Silicon)
8- - ** 51 tok/s single-thread** — on par with llama.cpp (50.7 tok/s)
9- - ** 7.5x KV cache compression** with 0.999 cosine similarity
10- - ** 8 quantization types** : Uniform, Mixed, PolarQuant, QJL, TurboQuant
11- - ** TQM format** : pre-quantized binary model, mmap instant load (0.3s)
7+ - ** 3 models supported** : Gemma 3 4B, Qwen3.5-0.8B, Gemma 3 270M
8+ - ** 3.8x KV cache compression** — at 32K context: 1.2 GB vs llama.cpp's 4.4 GB
9+ - ** llama.cpp parity** : 51 tok/s single-thread (vs 50.7 tok/s)
10+ - ** Multi-shard safetensors** : loads sharded models (Gemma 4B = 2 shards)
11+ - ** Dual tokenizer** : GPT2 byte-level BPE + SentencePiece auto-detect
12+ - ** TQM format** : pre-quantized mmap binary, instant loading
1213- ** Zero dependencies** : libc only, ~ 1MB binary
13- - ** One-command quickstart** : ` bash scripts/quickstart.sh `
1414
15- ### What's Included
15+ ### Supported Models
1616
17- - Complete inference engine: DeltaNet + Self-Attention hybrid (Qwen3.5)
18- - BPE tokenizer (248K vocab, embedded in TQM)
19- - Q4 weight quantization with NEON 2-row batching
20- - Thread pool with zero-overhead dispatch
21- - Integer Q4×Q8 attention (ARM vdotq_s32)
22- - 19 test suites, 135 tests
23- - Python bindings (ctypes)
24- - llama.cpp / vLLM integration stubs
17+ | Model | Speed (Q4, 6T) | Quality |
18+ | -------| ----------------| ---------|
19+ | Gemma 3 4B | 5.2 tok/s | "capital of France" → "Paris" |
20+ | Qwen3.5-0.8B | 82 tok/s | 0.999 cosine vs PyTorch |
21+ | Gemma 3 270M | 176 tok/s | per-layer exact match |
22+
23+ ### KV Cache Memory Savings
24+
25+ ```
26+ Gemma 3 4B at 32K context:
27+ llama.cpp (FP16 KV): 4,352 MB
28+ TurboQuant (Q4 KV): 1,156 MB ← 3.8x compression
29+ ```
2530
2631### Quick Start
2732
@@ -30,8 +35,18 @@ git clone https://github.com/quantumaikr/TurboQuant.cpp && cd TurboQuant.cpp
3035bash scripts/quickstart.sh " What is deep learning?"
3136```
3237
38+ ### What's Inside
39+
40+ - 9,000+ lines of pure C — complete inference engine
41+ - 8 quantization types: Uniform, Mixed, PolarQuant, QJL, TurboQuant
42+ - Architecture dispatch: Qwen3.5 (DeltaNet + Attention) + Gemma 3 (Sliding Window + GQA)
43+ - Q4 weight quantization with NEON 2-row batching + thread pool
44+ - Integer Q4×Q8 attention via ARM vdotq_s32
45+ - 20 test suites, 70+ tests
46+ - Python bindings (ctypes), llama.cpp/vLLM integration stubs
47+
3348### References
3449
35- - [ TurboQuant] ( https://arxiv.org/abs/2504.19874 ) (ICLR 2026)
36- - [ QJL] ( https://arxiv.org/abs/2406.03482 ) (AAAI 2025)
37- - [ PolarQuant] ( https://arxiv.org/abs/2502.02617 ) (AISTATS 2026)
50+ - [ TurboQuant] ( https://arxiv.org/abs/2504.19874 ) (ICLR 2026) — KV cache compression
51+ - [ QJL] ( https://arxiv.org/abs/2406.03482 ) (AAAI 2025) — 1-bit quantized JL transform
52+ - [ PolarQuant] ( https://arxiv.org/abs/2502.02617 ) (AISTATS 2026) — Polar coordinate quantization
0 commit comments