logo
today local_bar
Poster Session 5 · Friday, December 5, 2025 11:00 AM → 2:00 PM
#5518

Rollout Roulette: A Probabilistic Inference Approach to Inference-Time Scaling of LLMs using Particle-Based Monte Carlo Methods

NeurIPS OpenReview

Abstract

Large language models (LLMs) have achieved significant performance gains via scaling up model sizes and/or data. However, recent evidence suggests diminishing returns from such approaches, motivating a pivot to scaling test-time compute.
Existing deterministic inference-time scaling methods, usually with reward models, cast the task as a search problem, but suffer from a key limitation: early pruning. Due to inherently imperfect reward models, promising trajectories may be discarded prematurely, leading to suboptimal performance.
We propose a novel inference-time scaling approach by adapting particle-based Monte Carlo methods. Our method maintains a diverse set of candidates and robustly balances exploration and exploitation. Our empirical evaluation demonstrates that our particle filtering methods have a 4-16x better scaling rate over deterministic search counterparts on both various challenging mathematical and more general reasoning tasks.
Using our approach, we show that Qwen2.5-Math-1.5B-Instruct surpasses GPT-4o accuracy in only 4 rollouts, while Qwen2.5-Math-7B-Instruct scales to o1 level accuracy in only 32 rollouts. Our work not only presents an effective method to inference-time scaling, but also connects rich literature in probabilistic inference with inference-time scaling of LLMs to develop more robust algorithms in future work.