Scientist, Los Alamos National Laboratory
1 paper at NeurIPS 2025
We introduce a challenging benchmark to evaluate LLMs' mathematical reasoning and code-writing abilities, finding that specialized models like o1-mini outperform earlier ones but still struggle overall.