Undergrad student, University of Science and Technology of China
1 paper at NeurIPS 2025
This work shows that greedy sampling based on empirical estimates is provably efficient for RLHF, under both the general preference model and the Bradley-Terry model.