2 papers across 2 sessions
We design the first efficient, near-optimal regret algorithm for contextual dueling bandits using offline oracles, enabling scalable preference-based learning in RLHF and resolving a key open problem in AI alignment.
We release an open human-annotated preference dataset with 40 thousand samples spanning General, STEM, Code and Multilingual Samples, which can be used to train SOTA Reward Models on RM-Bench and JudgeBench