Poster Session 1 · Wednesday, December 3, 2025 11:00 AM → 2:00 PM
#1508
Reason-RFT: Reinforcement Fine-Tuning for Visual Reasoning of Vision Language Models
Abstract
Visual reasoning abilities play a crucial role in understanding complex multimodal data, advancing both domain-specific applications and artificial general intelligence (AGI). Existing methods enhance Vision-Language Models (VLMs) through Chain-of-Thought (CoT) supervised fine-tuning using meticulously annotated data. However, this approach may lead to overfitting and cognitive rigidity, limiting the model’s generalization ability under domain shifts and reducing real-world applicability.
To overcome these limitations, we propose Reason-RFT, a two-stage reinforcement fine-tuning framework for visual reasoning. First, Supervised Fine-Tuning (SFT) with curated CoT data activates the reasoning potential of VLMs. This is followed by reinforcement learning based on Group Relative Policy Optimization (GRPO), which generates multiple reasoning-response pairs to enhance adaptability to domain shifts.
To evaluate Reason-RFT, we reconstructed a comprehensive dataset covering visual counting, structural perception, and spatial transformation, serving as a benchmark for systematic assessment across three key dimensions. Experimental results highlight three advantages:
- performance enhancement, with Reason-RFT achieving state-of-the-art results and outperforming both open-source and proprietary models;
- generalization superiority, maintaining robust performance under domain shifts across various tasks; and
- data efficiency, excelling in few-shot learning scenarios and surpassing full-dataset SFT baselines.
Reason-RFT introduces a novel training paradigm for visual reasoning and marks a significant step forward in multimodal research.