1 paper across 1 session
We propose two techniques to improve the data efficiency of LLM RL fine-tuning: difficulty-targeted online data selection and rollout replay.