3 papers across 3 sessions
We contribute provable guarantees that regularized policy gradient methods converge in approximate Nash equilibria in imperfect-information extensive-form zero-sum games.
A simple general purpose off-policy REINFORCE method which outperforms PPO, DPO and STaR on recent benchmarks.
We learn a generative model of the Pareto set that can be conditioned on subjective preferences, without retraining, for online multi-objective optimization tasks on discrete/mixed spaces.