Researcher, Microsoft
1 paper at NeurIPS 2025
We propose two techniques to improve the data efficiency of LLM RL fine-tuning: difficulty-targeted online data selection and rollout replay.