2 papers across 1 session
We develop Q-Palette, a quantizer suite with efficient inference CUDA kernels and wide fractional-bit support, enabling mixed-scheme quantization that achieves ~36% faster LLM decoding than NormalFloat while improving accuracy.
We propose a novel query-agnostic KV cache eviction method for multi-query scenario.