3 papers across 2 sessions
Our work, Mustafar, unlocks 70% sparsity in KV cache pruning by leveraging unstructured sparsity pattern, supported by a custom attention kernel, and boosts the inference efficiency of LLMs.