kv-cache
Here are 135 public repositories matching this topic...
Unified KV Cache Compression Methods for Auto-Regressive Models
-
Updated
Jan 4, 2025 - Python
LLM KV cache compression made easy
-
Updated
Apr 9, 2026 - Python
LLM notes, including model inference, transformer model structure, and llm framework code analysis notes.
-
Updated
Apr 2, 2026 - Python
Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation (NeurIPS 2025)
-
Updated
Sep 26, 2025 - Python
[NeurIPS'23] H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models.
-
Updated
Aug 1, 2024 - Python
[ICLR'26] The official code implementation for "Cache-to-Cache: Direct Semantic Communication Between Large Language Models"
-
Updated
Mar 13, 2026 - Python
Run larger LLMs with longer contexts on Apple Silicon by using differentiated precision for KV cache quantization. KVSplit enables 8-bit keys & 4-bit values, reducing memory by 59% with <1% quality loss. Includes benchmarking, visualization, and one-command setup. Optimized for M1/M2/M3 Macs with Metal support.
-
Updated
May 21, 2025 - Python
[ACL 2026] Towards Efficient Large Language Model Serving: A Survey on System-Aware KV Cache Optimization
-
Updated
Apr 7, 2026 - Python
LLM inference engine from scratch — paged KV cache, continuous batching, chunked prefill, prefix caching, speculative decoding, CUDA graph, tensor parallelism, OpenAI-compatible serving
-
Updated
Apr 9, 2026 - Python
Block Transformer: Global-to-Local Language Modeling for Fast Inference (NeurIPS 2024)
-
Updated
Apr 13, 2025 - Python
[NeurIPS'25] KVCOMM: Online Cross-context KV-cache Communication for Efficient LLM-based Multi-agent Systems
-
Updated
Nov 3, 2025 - Python
3% Is All You Need: Breaking TurboQuant's Compression Limit via Spectral Structure
-
Updated
Apr 7, 2026 - Python
Completion After Prompt Probability. Make your LLM make a choice
-
Updated
Nov 2, 2024 - Python
TurboQuant KV cache compression for MLX with fused Metal kernels. 4.6x compression at 98% FP16 speed.
-
Updated
Apr 2, 2026 - Python
This repository contains an implementation of the LLaMA 2 (Large Language Model Meta AI) model, a Generative Pretrained Transformer (GPT) variant. The implementation focuses on the model architecture and the inference process. The code is restructured and heavily commented to facilitate easy understanding of the key parts of the architecture.
-
Updated
Oct 1, 2023 - Python
Notes about LLaMA 2 model
-
Updated
Aug 30, 2023 - Python
Easy control for Key-Value Constrained Generative LLM Inference(https://arxiv.org/abs/2402.06262)
-
Updated
Feb 13, 2024 - Python
First open-source implementation of Google TurboQuant (ICLR 2026) -- near-optimal KV cache compression for LLM inference. 5x compression with near-zero quality loss.
-
Updated
Apr 1, 2026 - Python
Improve this page
Add a description, image, and links to the kv-cache topic page so that developers can more easily learn about it.
Add this topic to your repo
To associate your repository with the kv-cache topic, visit your repo's landing page and select "manage topics."