1 paper across 1 session
Using a new suite of MuJoCo tasks for systematic evaluation, we develop specialized mirror descent-based preference optimization algorithms that outperform existing methods in both MuJoCo and LLM alignment tasks.