PhD student, School of Engineering and Applied Sciences, Harvard University
1 paper at NeurIPS 2025
We characterize reward hacking in inference time alignment and demonstrate via three inference-time methods that we can leverage hedging to provably mitigate reward hacking.