researcher, Microsoft Research
1 paper at NeurIPS 2025
We design the first efficient, near-optimal regret algorithm for contextual dueling bandits using offline oracles, enabling scalable preference-based learning in RLHF and resolving a key open problem in AI alignment.