5 papers across 3 sessions
Through theoretical models and empirical testbeds, we characterize the algorithmic tradeoff between privileged expert distillation and RL, and better options for expert distillation.
We introduce a new online RLHF algorithm that for the first time achieves a sample complexity that scales polynomially with the reward scale.