1 paper across 1 session
A novel method for compressing the attention Key-Value cache along the temporal dimension, greatly reducing inference-time GPU memory usage and improving decoding speed.