Postdoc, Harvard University
2 papers at NeurIPS 2025
We characterize reward hacking in inference time alignment and demonstrate via three inference-time methods that we can leverage hedging to provably mitigate reward hacking.
Using coding theory and heavy tails to design powerful watermarks with zero distortion.