strategyproofness

1 paper across 1 session

Poster Session 1

Wednesday, December 3, 2025 · 11:00 AM → 2:00 PM

Strategyproof Reinforcement Learning from Human Feedback

#408 · Thomas Kleine Buening, Jiarui Gan, Debmalya Mandal, Marta Kwiatkowska

We show RLHF is vulnerable to strategic manipulation, discuss trade-offs between incentive and policy alignment, and propose an approximately strategyproof algorithm to address it.