1 paper across 1 session
Unify supervised & reinforcement fine-tuning, and outperforms both of them. Together with theoretical justifications.