2 papers across 2 sessions
an importance-sampling-based method to mitigate over-optimization in Direct Alignment Algorithms for language model alignment
GradSPO reinterprets Stepwise Preference Optimization (SPO) through a novel gradient guidance lens, enabling a simplified objective and integrated noise reduction to achieve superior human preference alignment in text-to-image models.