Skip to content

Commit ee6f8de

Browse files
unamedkrclaude
andcommitted
Update release notes for v0.1.0 with Gemma 4B + KV benchmark
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 288bf9f commit ee6f8de

File tree

1 file changed

+34
-19
lines changed

1 file changed

+34
-19
lines changed

scripts/release_notes_v0.1.0.md

Lines changed: 34 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -1,27 +1,32 @@
11
## TurboQuant.cpp v0.1.0 — First Release
22

3-
Pure C LLM inference engine with KV cache compression. Matches llama.cpp single-thread speed.
3+
Multi-architecture LLM inference engine in pure C with KV cache compression.
44

55
### Highlights
66

7-
- **82 tok/s peak** on Qwen3.5-0.8B (Q4, CPU-only, Apple Silicon)
8-
- **51 tok/s single-thread** — on par with llama.cpp (50.7 tok/s)
9-
- **7.5x KV cache compression** with 0.999 cosine similarity
10-
- **8 quantization types**: Uniform, Mixed, PolarQuant, QJL, TurboQuant
11-
- **TQM format**: pre-quantized binary model, mmap instant load (0.3s)
7+
- **3 models supported**: Gemma 3 4B, Qwen3.5-0.8B, Gemma 3 270M
8+
- **3.8x KV cache compression** — at 32K context: 1.2 GB vs llama.cpp's 4.4 GB
9+
- **llama.cpp parity**: 51 tok/s single-thread (vs 50.7 tok/s)
10+
- **Multi-shard safetensors**: loads sharded models (Gemma 4B = 2 shards)
11+
- **Dual tokenizer**: GPT2 byte-level BPE + SentencePiece auto-detect
12+
- **TQM format**: pre-quantized mmap binary, instant loading
1213
- **Zero dependencies**: libc only, ~1MB binary
13-
- **One-command quickstart**: `bash scripts/quickstart.sh`
1414

15-
### What's Included
15+
### Supported Models
1616

17-
- Complete inference engine: DeltaNet + Self-Attention hybrid (Qwen3.5)
18-
- BPE tokenizer (248K vocab, embedded in TQM)
19-
- Q4 weight quantization with NEON 2-row batching
20-
- Thread pool with zero-overhead dispatch
21-
- Integer Q4×Q8 attention (ARM vdotq_s32)
22-
- 19 test suites, 135 tests
23-
- Python bindings (ctypes)
24-
- llama.cpp / vLLM integration stubs
17+
| Model | Speed (Q4, 6T) | Quality |
18+
|-------|----------------|---------|
19+
| Gemma 3 4B | 5.2 tok/s | "capital of France" → "Paris" |
20+
| Qwen3.5-0.8B | 82 tok/s | 0.999 cosine vs PyTorch |
21+
| Gemma 3 270M | 176 tok/s | per-layer exact match |
22+
23+
### KV Cache Memory Savings
24+
25+
```
26+
Gemma 3 4B at 32K context:
27+
llama.cpp (FP16 KV): 4,352 MB
28+
TurboQuant (Q4 KV): 1,156 MB ← 3.8x compression
29+
```
2530

2631
### Quick Start
2732

@@ -30,8 +35,18 @@ git clone https://github.com/quantumaikr/TurboQuant.cpp && cd TurboQuant.cpp
3035
bash scripts/quickstart.sh "What is deep learning?"
3136
```
3237

38+
### What's Inside
39+
40+
- 9,000+ lines of pure C — complete inference engine
41+
- 8 quantization types: Uniform, Mixed, PolarQuant, QJL, TurboQuant
42+
- Architecture dispatch: Qwen3.5 (DeltaNet + Attention) + Gemma 3 (Sliding Window + GQA)
43+
- Q4 weight quantization with NEON 2-row batching + thread pool
44+
- Integer Q4×Q8 attention via ARM vdotq_s32
45+
- 20 test suites, 70+ tests
46+
- Python bindings (ctypes), llama.cpp/vLLM integration stubs
47+
3348
### References
3449

35-
- [TurboQuant](https://arxiv.org/abs/2504.19874) (ICLR 2026)
36-
- [QJL](https://arxiv.org/abs/2406.03482) (AAAI 2025)
37-
- [PolarQuant](https://arxiv.org/abs/2502.02617) (AISTATS 2026)
50+
- [TurboQuant](https://arxiv.org/abs/2504.19874) (ICLR 2026) — KV cache compression
51+
- [QJL](https://arxiv.org/abs/2406.03482) (AAAI 2025) — 1-bit quantized JL transform
52+
- [PolarQuant](https://arxiv.org/abs/2502.02617) (AISTATS 2026) — Polar coordinate quantization

0 commit comments

Comments
 (0)