Researcher, Google DeepMind
2 papers at NeurIPS 2025
Using a new suite of MuJoCo tasks for systematic evaluation, we develop specialized mirror descent-based preference optimization algorithms that outperform existing methods in both MuJoCo and LLM alignment tasks.
We present the Rao-Blackwellised reparameterisation gradient estimator and show its performance gains on a suite of probabilistic models.