1 paper across 1 session
This paper introduces a simple and scalable semi-off-policy reinforcement learning method, i.e., SOPHIA, to enhance LVLMs’ ability to perform visual slow-thinking reasoning.