Latent Chain-of-Thought for Visual Reasoning

Guohao Sun, Hang Hua, Jian Wang, Jiebo Luo, Sohail Dianat, MAJID RABBANI, Raghuveer Rao, Zhiqiang Tao

Rochester Institute of Technology· Snap Inc.· University of Rochester· DEVCOM Army Research Laboratory

vision-language model reasoning variational inference

Abstract

Chain-of-thought (CoT) reasoning is critical for improving the interpretability and reliability of Large Vision-Language Models (LVLMs). However, existing training algorithms such as SFT, PPO, and GRPO may not generalize well across unseen reasoning tasks and heavily rely on a biased reward model.

To address this challenge, we reformulate reasoning in LVLMs as posterior inference and propose a scalable training algorithm based on amortized variational inference.

By leveraging diversity-seeking reinforcement learning algorithms, we introduce a novel sparse reward function for token-level learning signals that encourage diverse, high-likelihood latent CoT, overcoming deterministic sampling limitations and avoiding reward hacking. Additionally, we implement a Bayesian inference-scaling strategy that replaces costly Best-of-N and Beam Search with a marginal likelihood to efficiently rank optimal rationales and answers.

We empirically demonstrate that the proposed method enhances the state-of-the-art LVLMs on four reasoning benchmarks, in terms of effectiveness, generalization, and interpretability.