Reason-RFT: Reinforcement Fine-Tuning for Visual Reasoning of Vision Language Models

Huajie Tan, Yuheng Ji, Xiaoshuai Hao, Xiansheng Chen, Pengwei Wang, Zhongyuan Wang, Shanghang Zhang

Peking University· Beijing Academy of Artificial Intelligence· Chinese Academy of Sciences· UCAS

Multimodal Reinforcement Fine-Tuning Visual Reasoning

Abstract

Visual reasoning abilities play a crucial role in understanding complex multimodal data, advancing both domain-specific applications and artificial general intelligence (AGI). Existing methods enhance Vision-Language Models (VLMs) through Chain-of-Thought (CoT) supervised fine-tuning using meticulously annotated data. However, this approach may lead to overfitting and cognitive rigidity, limiting the model’s generalization ability under domain shifts and reducing real-world applicability.

To overcome these limitations, we propose Reason-RFT, a two-stage reinforcement fine-tuning framework for visual reasoning. First, Supervised Fine-Tuning (SFT) with curated CoT data activates the reasoning potential of VLMs. This is followed by reinforcement learning based on Group Relative Policy Optimization (GRPO), which generates multiple reasoning-response pairs to enhance adaptability to domain shifts.

To evaluate Reason-RFT, we reconstructed a comprehensive dataset covering visual counting, structural perception, and spatial transformation, serving as a benchmark for systematic assessment across three key dimensions. Experimental results highlight three advantages:

performance enhancement, with Reason-RFT achieving state-of-the-art results and outperforming both open-source and proprietary models;
generalization superiority, maintaining robust performance under domain shifts across various tasks; and
data efficiency, excelling in few-shot learning scenarios and surpassing full-dataset SFT baselines.

Reason-RFT introduces a novel training paradigm for visual reasoning and marks a significant step forward in multimodal research.