1 paper across 1 session
We introduce a theoretically-grounded distributional RL algorithm for LLM post-training that demonstrates improvement upon prior work on both synthetic and mathematical reasoning tasks.