2 papers across 2 sessions
We introduce a challenging benchmark to evaluate LLMs' mathematical reasoning and code-writing abilities, finding that specialized models like o1-mini outperform earlier ones but still struggle overall.