Enhancing the Outcome Reward-based RL Training of MLLMs with Self-Consistency Sampling

Jiahao Wang, Weiye Xu, Aijun Yang, Wengang Zhou, Lewei Lu, Houqiang Li, Xiaohua Wang, Jinguo Zhu

Xi'an Jiaotong University· University of Science and Technology of China· Shanghai Artificial Intelligence Laboratory· SenseTime Research

Self-Consistency Outcome Reward-based RL MLLM

⋅ NeurIPS ⋅ Poster ⋅OpenReview

Abstract

Outcome-reward reinforcement learning (RL) is a common—and increasingly significant—way to refine the step-by-step reasoning of multimodal large language models (MLLMs). In the multiple-choice setting—a dominant format for multimodal reasoning benchmarks—the paradigm faces a significant yet often overlooked obstacle: unfaithful trajectories that guess the correct option after a faulty chain of thought receive the same reward as genuine reasoning, which is a flaw that cannot be ignored.

We propose Self-Consistency Sampling (SCS) to correct this issue. For each question, SCS (i) introduces small visual perturbations and (ii) performs repeated truncation-and-resampling of a reference trajectory; agreement among the resulting trajectories yields a differentiable consistency score that down-weights unreliable traces during policy updates.

Plugging SCS into RLOO, GRPO, REINFORCE++ series improves accuracy by up to 7.7 percentage points on six multimodal benchmarks with negligible extra computation, offering a simple, general remedy for outcome-reward RL in MLLMs.