1 paper across 1 session
We recast offline RL as reward-weighted fine-tuning, which allows practical RL optimization of LLM agents using just SFT.