A Theory for Worst-Case vs. Average-Case Guarantees for LLMs

Noga Amit, Shafi Goldwasser, Orr Paradise, Guy N. Rothblum

Trustworthy ML Interactive Proofs Computational Complexity Theory

Abstract

How can we trust the correctness of a learned model on a particular input of interest? Model accuracy is typically measured on average over a distribution of inputs, giving no guarantee for any fixed input.

This paper proposes a theoretically-founded solution to this problem: to train Self-Proving models that prove the correctness of their output to a verification algorithm

V

via an Interactive Proof. Self-Proving models satisfy that, with high probability over an input sampled from a given distribution, the model generates a correct output and successfully proves its correctness to

V

The soundness property of

V

guarantees that, for every input, no model can convince

V

of the correctness of an incorrect output. Thus, a Self-Proving model proves correctness of most of its outputs, while all incorrect outputs (of any model) are detected by

V

We devise and analyze two generic methods for learning Self-Proving models: Transcript Learning (TL) which relies on access to transcripts of accepting interactions, and Reinforcement Learning from Verifier Feedback (RLVF) which trains a model by emulating interactions with the verifier.