Principal Researcher, Microsoft
2 papers at NeurIPS 2025
QRPO is a SoTA alignment algorithm that can fit the KL-regularized RL objective without relying on preferences.