3 papers across 2 sessions
We introduce a new comprehensive benchmark, MMTU, designed to evaluate models ability to understand, reason, and manipulate diverse tables.
We introduce ALE-bench, a new benchmark for evaluating AI systems on score-based algorithmic programming contests.