1 paper across 1 session
We show RLHF is vulnerable to strategic manipulation, discuss trade-offs between incentive and policy alignment, and propose an approximately strategyproof algorithm to address it.