Poster Session 4 · Thursday, December 4, 2025 4:30 PM → 7:30 PM
#4719
Retrv-R1: A Reasoning-Driven MLLM Framework for Universal and Efficient Multimodal Retrieval
Abstract
The success of DeepSeek-R1 demonstrates the immense potential of using reinforcement learning (RL) to enhance LLMs' reasoning capabilities. This paper introduces Retrv-R1, the first R1-style MLLM specifically designed for multimodal universal retrieval, achieving higher performance by employing step-by-step reasoning to produce more accurate retrieval results.
We find that directly applying the methods of DeepSeek-R1 to retrieval tasks is not feasible, mainly due to:
- the high computational cost caused by the large token consumption required for multiple candidates with reasoning processes, and
- the instability and suboptimal results when directly applying RL to train for retrieval tasks.
To address these issues, Retrv-R1 introduces an information compression module with a details inspection mechanism, which enhances computational efficiency by reducing the number of tokens while ensuring that critical information for challenging candidates is preserved. Additionally, a new training paradigm is proposed, including an activation stage using a retrieval-tailored synthetic CoT dataset for more effective optimization, followed by RL with a novel curriculum reward to improve both performance and efficiency.
Incorporating these novel designs, Retrv-R1 achieves SOTA performance, high efficiency, and strong generalization ability, as demonstrated by extensive experiments across multiple benchmarks and tasks.