1 paper across 1 session
We propose Hölder-DPO, the first alignment method with a provable redescending property, which enables robust learning from noisy human feedback by identifying and correcting mislabeled data, improving alignment and model performance.