5 papers across 3 sessions
We show RLHF is vulnerable to strategic manipulation, discuss trade-offs between incentive and policy alignment, and propose an approximately strategyproof algorithm to address it.
Theoretical inconsistencies among Responsible AI metrics aren't a problem but a benefit – they enable both pluralistic approaches to alignment that respect diverse values, help conceptual understanding, and create more robust, adaptable models.