KV CACHE Compression
updated
SnapKV: LLM Knows What You are Looking for Before Generation
Paper
• 2404.14469
• Published • 27
Finch: Prompt-guided Key-Value Cache Compression
Paper
• 2408.00167
• Published • 17
Beyond RAG: Task-Aware KV Cache Compression for Comprehensive Knowledge
Reasoning
Paper
• 2503.04973
• Published • 27
A Simple and Effective L_2 Norm-Based Strategy for KV Cache
Compression
Paper
• 2406.11430
• Published • 25
FastKV: Decoupling of Context Reduction and KV Cache Compression for Prefill-Decoding Acceleration
Paper
• 2502.01068
• Published • 18
ChunkKV: Semantic-Preserving KV Cache Compression for Efficient
Long-Context LLM Inference
Paper
• 2502.00299
• Published • 3
Efficient Streaming Language Models with Attention Sinks
Paper
• 2309.17453
• Published • 14
Transformers are Multi-State RNNs
Paper
• 2401.06104
• Published • 39
H_2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large
Language Models
Paper
• 2306.14048
• Published • 14
Q-Filters: Leveraging QK Geometry for Efficient KV Cache Compression
Paper
• 2503.02812
• Published • 10
ThinK: Thinner Key Cache by Query-Driven Pruning
Paper
• 2407.21018
• Published • 32
LightTransfer: Your Long-Context LLM is Secretly a Hybrid Model with
Effortless Adaptation
Paper
• 2410.13846
• Published • 2
DuoAttention: Efficient Long-Context LLM Inference with Retrieval and
Streaming Heads
Paper
• 2410.10819
• Published • 7
Scissorhands: Exploiting the Persistence of Importance Hypothesis for
LLM KV Cache Compression at Test Time
Paper
• 2305.17118
• Published • 1
PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information
Funneling
Paper
• 2406.02069
• Published • 1