Associate Professor, Seoul National University
3 papers at NeurIPS 2025
We develop Q-Palette, a quantizer suite with efficient inference CUDA kernels and wide fractional-bit support, enabling mixed-scheme quantization that achieves ~36% faster LLM decoding than NormalFloat while improving accuracy.
We propose a novel query-agnostic KV cache eviction method for multi-query scenario.
We propose a novel self-improvement algorithm to teach language models to perform effective search.