Postdoc, Institute of Science and Technology Austria
3 papers at NeurIPS 2025
We propose a parallel generation method for LLMs, where multiple instances synchronize through a shared, dynamically-updated attention cache
Automatically detecting task-specific important tokens to accelerate speculative decoding
We investigate new scaling laws which predict the scaling of LLMs when training them over quantized or sparse representations.