1 paper across 1 session
We design the first efficient, near-optimal regret algorithm for contextual dueling bandits using offline oracles, enabling scalable preference-based learning in RLHF and resolving a key open problem in AI alignment.