PhD student, University of Edinburgh
1 paper at NeurIPS 2025
Inference-time hyper-scaling uses key–value cache compression with Delayed Memory Sparsification (DMS) to boost Transformer LLM reasoning accuracy for equivalent compute or memory costs.