Full Professor, Polish Academy of Sciences
2 papers at NeurIPS 2025
We show RLHF is vulnerable to strategic manipulation, discuss trade-offs between incentive and policy alignment, and propose an approximately strategyproof algorithm to address it.
We propose MIBP-Cert, a certified training method that uses mixed-integer bilinear programming to compute tight robustness guarantees under complex training-time perturbations. Our approach stabilizes training and improves certified accuracy.