The Automated LLM Speedrunning Benchmark: Reproducing NanoGPT Improvements
#3313 · Bingchen Zhao, Despoina Magka, Minqi Jiang, Xian Li, Roberta Raileanu, Tatiana Shavrina, Jean-Christophe Gagnon-Audet, Kelvin Niu, Shagun Sodhani, Michael Shvartsman, Andrei Lupu, Alisia Lupidi, Karen Hambardzumyan, Martin Josifoski, Edan Toledo, Thomas Foster, Lucia Cipolina Kun, Derek Dunfield, Abhishek Charnalia, Alexander Miller, Oisin Mac Aodha, Jakob Foerster, Yoram Bachrach
We introduce the Automated LLM Speedrunning benchmark to assess the capabilities of AI agents to reproduce LLM research