2 papers across 2 sessions
Using a heterogeneous Mixture-of-Experts model architecture, we show that brain-like processing pathways form due to inductive biases on processing complexity and expert dropout
We introduce a new method for principled, effective distillation across tokenizers, enabling a number of new applications.