3 papers across 1 session
Polar Sparsity scales contextual sparsity to large batches by exploiting stable attention head sparsity and using efficient GPU kernels, achieving up to 2.2× speedups with minimal accuracy loss.
We propose DuoGPT, a training-free pruning framework that integrates activation sparsity into OBC framework to enable efficient dual-sparse LLM inference with state-of-the-art accuracy–efficiency trade-offs and scalability.