1 paper across 1 session
We characterize reward hacking in inference time alignment and demonstrate via three inference-time methods that we can leverage hedging to provably mitigate reward hacking.