1 paper across 1 session
We propose Rectified Policy Optimization (RePO) to mitigate "safety compensation", which replaces the average safety metric with stricter safety constraints.