Lecturer, University of Oxford
2 papers at NeurIPS 2025
We show RLHF is vulnerable to strategic manipulation, discuss trade-offs between incentive and policy alignment, and propose an approximately strategyproof algorithm to address it.