1 paper across 1 session
QRPO is a SoTA alignment algorithm that can fit the KL-regularized RL objective without relying on preferences.