Implemented Heavy Hitter Oracle, a dynamic sparse KV-cache mechanism, in the vLLM inference engine. Achieved 20–30% speedup at equivalent sparsity levels by selectively retaining high-attention keys during decoding.
vLLM
LLM Inference
Sparse Attention