2 papers across 2 sessions
We propose an adaptive layer reuse technique that dynamically reuse intermediate feature across adjacent denoising steps to enable efficient inference of text-to-video generation models
Our work, Mustafar, unlocks 70% sparsity in KV cache pruning by leveraging unstructured sparsity pattern, supported by a custom attention kernel, and boosts the inference efficiency of LLMs.