PhD student, University of Cambridge
1 paper at NeurIPS 2025
A novel method for compressing the attention Key-Value cache along the temporal dimension, greatly reducing inference-time GPU memory usage and improving decoding speed.