Researcher, Yandex
2 papers at NeurIPS 2025
We propose a parallel generation method for LLMs, where multiple instances synchronize through a shared, dynamically-updated attention cache
Automatically detecting task-specific important tokens to accelerate speculative decoding