1 paper across 1 session
We introduce a challenging benchmark to evaluate LLMs' mathematical reasoning and code-writing abilities, finding that specialized models like o1-mini outperform earlier ones but still struggle overall.