kv-cache

Star

Here are 135 public repositories matching this topic...

LMCache / LMCache

Star

Supercharge Your LLM with the Fastest KV Cache Layer

fast amd cuda inference pytorch speed rocm kv-cache llm vllm

Updated Apr 11, 2026
Python

Zefan-Cai / KVCache-Factory

Star

Unified KV Cache Compression Methods for Auto-Regressive Models

kv-cache llm kv-cache-compression

Updated Jan 4, 2025
Python

NVIDIA / kvpress

Star

LLM KV cache compression made easy

python transformers inference pytorch kv-cache large-language-models llm long-context kv-cache-compression

Updated Apr 9, 2026
Python

harleyszhang / llm_note

Star

LLM notes, including model inference, transformer model structure, and llm framework code analysis notes.

cuda-programming transformer-models kv-cache llm vllm llm-inference triton-kernels

Updated Apr 2, 2026
Python

raymin0223 / mixture_of_recursions

Star

Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation (NeurIPS 2025)

router early-exiting adaptive-computation kv-cache llm recursive-transformers

Updated Sep 26, 2025
Python

FMInference / H2O

Star

[NeurIPS'23] H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models.

sparsity high-throughput heavy-hitters kv-cache gpt-3 large-language-models

Updated Aug 1, 2024
Python

thu-nics / C2C

Star

[ICLR'26] The official code implementation for "Cache-to-Cache: Direct Semantic Communication Between Large Language Models"

multi-agent kv-cache llm

Updated Mar 13, 2026
Python

Run larger LLMs with longer contexts on Apple Silicon by using differentiated precision for KV cache quantization. KVSplit enables 8-bit keys & 4-bit values, reducing memory by 59% with <1% quality loss. Includes benchmarking, visualization, and one-command setup. Optimized for M1/M2/M3 Macs with Metal support.

metal optimization quantization m2 m3 m1 memory-optimization kv-cache apple-silicon llm generative-ai llama-cpp

Updated May 21, 2025
Python

jjiantong / Awesome-KV-Cache-Optimization

Star

[ACL 2026] Towards Efficient Large Language Model Serving: A Survey on System-Aware KV Cache Optimization

machine-learning ai system computer-architecture neural-language-processing mlsys kv-cache serving-ml llm llm-serving llm-inference

Updated Apr 7, 2026
Python

psmarter / mini-infer

Star

LLM inference engine from scratch — paged KV cache, continuous batching, chunked prefill, prefix caching, speculative decoding, CUDA graph, tensor parallelism, OpenAI-compatible serving

machine-learning cuda inference pytorch transformer triton moe quantization language-model inference-engine kv-cache tensor-parallelism llm speculative-decoding pagedattention continuous-batching

Updated Apr 9, 2026
Python

itsnamgyu / block-transformer

Star

Block Transformer: Global-to-Local Language Modeling for Fast Inference (NeurIPS 2024)

kv-cache llm llm-inference llm-architecture kv-cache-compression

Updated Apr 13, 2025
Python

FastMAS / KVCOMM

Star

[NeurIPS'25] KVCOMM: Online Cross-context KV-cache Communication for Efficient LLM-based Multi-agent Systems

multi-agent-systems kv-cache neurips-2025

Updated Nov 3, 2025
Python

Dynamis-Labs / spectralquant

Star

3% Is All You Need: Breaking TurboQuant's Compression Limit via Spectral Structure

machine-learning compression pytorch transformer quantization research-paper spectral-analysis kv-cache large-language-models llm-inference

Updated Apr 7, 2026
Python

kddubey / cappr

Star

Completion After Prompt Probability. Make your LLM make a choice

text-classification probability zero-shot huggingface kv-cache prompt-engineering llamacpp llm-inference

Updated Nov 2, 2024
Python

arozanov / turboquant-mlx

Star

TurboQuant KV cache compression for MLX with fused Metal kernels. 4.6x compression at 98% FP16 speed.

metal quantization mlx kv-cache apple-silicon llm turboquant

Updated Apr 2, 2026
Python

aju22 / LLaMA2

Star

This repository contains an implementation of the LLaMA 2 (Large Language Model Meta AI) model, a Generative Pretrained Transformer (GPT) variant. The implementation focuses on the model architecture and the inference process. The code is restructured and heavily commented to facilitate easy understanding of the key parts of the architecture.

natural-language-processing transformer attention llama gpt rope kv-cache llm llama2 rms-norm

Updated Oct 1, 2023
Python

hkproj / pytorch-llama-notes

Star

Notes about LLaMA 2 model

study-notes rmsprop attention-is-all-you-need kv-cache rotary-position-encoding llama2

Updated Aug 30, 2023
Python

DRSY / EasyKV

Star

Easy control for Key-Value Constrained Generative LLM Inference(https://arxiv.org/abs/2402.06262)

cache-management kv-cache llm cache-eviction

Updated Feb 13, 2024
Python

dataflowr / llm_efficiency

Star

KV Cache & LoRA for minGPT

pytorch lora kv-cache llm

Updated Mar 4, 2026
Python

OnlyTerp / turboquant

Star

First open-source implementation of Google TurboQuant (ICLR 2026) -- near-optimal KV cache compression for LLM inference. 5x compression with near-zero quality loss.

machine-learning compression deep-learning pytorch transformer attention quantization iclr vector-quantization memory-optimization kv-cache google-research llm vllm llm-inference kv-cache-compression

Updated Apr 1, 2026
Python

Improve this page

Add a description, image, and links to the kv-cache topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the kv-cache topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv-cache

Here are 135 public repositories matching this topic...

LMCache / LMCache

Zefan-Cai / KVCache-Factory

NVIDIA / kvpress

harleyszhang / llm_note

raymin0223 / mixture_of_recursions

FMInference / H2O

thu-nics / C2C

dipampaul17 / KVSplit

jjiantong / Awesome-KV-Cache-Optimization

psmarter / mini-infer

itsnamgyu / block-transformer

FastMAS / KVCOMM

Dynamis-Labs / spectralquant

kddubey / cappr

arozanov / turboquant-mlx

aju22 / LLaMA2

hkproj / pytorch-llama-notes

DRSY / EasyKV

dataflowr / llm_efficiency

OnlyTerp / turboquant

Improve this page

Add this topic to your repo