1 paper across 1 session
Our work, Mustafar, unlocks 70% sparsity in KV cache pruning by leveraging unstructured sparsity pattern, supported by a custom attention kernel, and boosts the inference efficiency of LLMs.