Know What You Don't Know: Uncertainty Calibration of Process Reward Models

Young-Jin Park, Kristjan Greenewald, Kaveh Alim, Hao Wang, Navid Azizan

MIT· MIT-IBM Watson AI Lab· Red Hat AI Innovation

process reward model calibration inference-time-scaling instance-adaptive scaling

Abstract

Process reward models (PRMs) play a central role in guiding inference-time scaling algorithms for large language models (LLMs). However, we observe that even state-of-the-art PRMs can be poorly calibrated. Specifically, they tend to overestimate the success probability that a partial reasoning step will lead to a correct final answer, particularly when smaller LLMs are used to complete the reasoning trajectory.

To address this, we present a calibration approach—performed via quantile regression—that adjusts PRM outputs to better align with true success probabilities. Leveraging these calibrated success estimates and their associated confidence bounds, we introduce an instance-adaptive scaling (IAS) framework that dynamically adjusts the compute budget based on the estimated likelihood that a partial reasoning trajectory will yield a correct final answer.

Unlike conventional methods that allocate a fixed number of reasoning trajectories per query, this approach adapts to each instance and reasoning step when using our calibrated PRMs.

Experiments on mathematical reasoning benchmarks show that:

our PRM calibration method achieves small calibration error, outperforming the baseline methods,
calibration is crucial for enabling effective IAS, and
the proposed IAS strategy reduces inference costs while maintaining final answer accuracy, utilizing less compute on more confident problems as desired.