Research Scientist, DeepMind
2 papers at NeurIPS 2025
Scaling RL-based reasoning to unverifiable data
In the context of off-policy RL, we give a theoretical analysis of the role of an additive reward correction in improving performance, accompanied by experiments on bandits and LLM posttraining.