CLEVER: A Curated Benchmark for Formally Verified Code Generation

Amitayush Thakur, Jasper Lee, George Tsoukalas, Meghana Sistla, Matthew Zhao, Stefan Zetzsche, Greg Durrett, Yisong Yue, Swarat Chaudhuri

University of Texas at Austin· Amazon· Caltech

Formal Verification Program Synthesis Language Models Formal Methods Interactive Theorem Provers Neural Theorem Proving Code Generation Verified Code Generation Proof-Guided Generation Formal Specification Mining Lean Theorem Prover Benchmark Design End-to-End Verification Natural Language to Code Automated Software Verification

⋅ NeurIPS ⋅ Project Page ⋅OpenReview

Abstract

We introduce

C LEVER

, a high-quality, manually curated benchmark of 161 problems for end-to-end verified code generation in Lean. Each problem consists of:

the task of generating a specification that matches a held-out ground-truth specification, and
the task of generating a Lean implementation that provably satisfies this specification.

Unlike prior benchmarks,

C LEVER

avoids test-case supervision, LLM-generated annotations, and specifications that leak implementation logic or allow vacuous solutions. All outputs are verified post-hoc using Lean's type checker to ensure machine-checkable correctness.

We use

C LEVER

to evaluate several few-shot and agentic approaches based on state-of-the-art language models. These methods all struggle to achieve full verification, establishing it as a challenging frontier benchmark for program synthesis and formal reasoning.

Our benchmark can be found on GitHub as well as HuggingFace. All our evaluation code is also available online.