3 papers across 2 sessions
Colocating online and offline LLM inference requests in the same inference engine.
Polar Sparsity scales contextual sparsity to large batches by exploiting stable attention head sparsity and using efficient GPU kernels, achieving up to 2.2× speedups with minimal accuracy loss.
Enhancing cost efficiency in LLM serving through an edge-assisted speculative decoding framework.