1 paper across 1 session
Our proposed orthogonality and variance losses improve performance in downstream fine-tuning of Mixture-of-Experts models by enhancing expert specificity, addressing expert homogenization caused by load balancing, while maintaining load balance.