Researcher, Tencent
1 paper at NeurIPS 2025
We presents Reward Dithering, a technique that enhances reinforcement learning in large language models by adding random perturbations to reward signals, improving training efficiency and convergence speed while maintaining performance.