logo
today local_bar
Poster Session 2 · Wednesday, December 3, 2025 4:30 PM → 7:30 PM
#1411

CLEVER: A Curated Benchmark for Formally Verified Code Generation

NeurIPS Project Page OpenReview

Abstract

We introduce , a high-quality, manually curated benchmark of 161 problems for end-to-end verified code generation in Lean. Each problem consists of:
  1. the task of generating a specification that matches a held-out ground-truth specification, and
  2. the task of generating a Lean implementation that provably satisfies this specification.
Unlike prior benchmarks, avoids test-case supervision, LLM-generated annotations, and specifications that leak implementation logic or allow vacuous solutions. All outputs are verified post-hoc using Lean's type checker to ensure machine-checkable correctness.
We use to evaluate several few-shot and agentic approaches based on state-of-the-art language models. These methods all struggle to achieve full verification, establishing it as a challenging frontier benchmark for program synthesis and formal reasoning.
Our benchmark can be found on GitHub as well as HuggingFace. All our evaluation code is also available online.