4 papers across 3 sessions
We introduce a RL framework to train LLM's reasoning and self-verification ability simultaneously.
How do models generalize to unseen constraints for precise instruction following?
We only need one example for RLVR on LLMs to achieve significant improvement on math tasks