2 papers across 2 sessions
We propose a lossless and training-free speculative decoding method to accelerate LLMs that requires offloading on a single memory-limited cosumer GPU.