PhD student, EPFL - EPF Lausanne
1 paper at NeurIPS 2025
QRPO is a SoTA alignment algorithm that can fit the KL-regularized RL objective without relying on preferences.