2 papers across 2 sessions
Use lighweight low rank (q K) to help indexing offloaded KVCached
We propose an efficient, training-free prompt compression method that retains key information within long inputs using the evaluator heads we identified in transformer-based LLMs.