3 papers across 2 sessions
The paper introduces a pruning and distillation method for hybrid LLMs, compressing Nemotron-H 8B to 4B with better accuracy and ~2× faster inference, advancing the efficiency-accuracy trade-off.
We present a unified theory for the study of RNN expressivity, with novel results on several popular architectures, and insights on the relationship between linear and non-linear RNNs.