2 papers across 2 sessions
We characterize reward hacking in inference time alignment and demonstrate via three inference-time methods that we can leverage hedging to provably mitigate reward hacking.