Postdoc, University of Oxford
1 paper at NeurIPS 2025
We propose a novel regret analysis of a simple policy gradient algorithm for bandits, characterizing regret regimes depending on its learning rate.