logo
today local_bar
Poster Session 1 · Wednesday, December 3, 2025 11:00 AM → 2:00 PM
#1806 Spotlight

TimE: A Multi-level Benchmark for Temporal Reasoning of LLMs in Real-World Scenarios

NeurIPS Project Page Slides Poster OpenReview

Abstract

Temporal reasoning is pivotal for Large Language Models (LLMs) to comprehend the real world. However, existing works neglect the real-world challenges for temporal reasoning:
  1. intensive temporal information,
  2. fast-changing event dynamics, and
  3. complex temporal dependencies in social interactions.
To bridge this gap, we propose a multi-level benchmark TimE, designed for temporal reasoning in real-world scenarios. TimE consists of 38,522 QA pairs, covering 3 levels with 11 fine-grained sub-tasks. This benchmark encompasses 3 sub-datasets reflecting different real-world challenges: TimE-Wiki, TimE-News, and TimE-Dial.
We conduct extensive experiments on reasoning models and non-reasoning models. And we conducted an in-depth analysis of temporal reasoning performance across diverse real-world scenarios and tasks, and summarized the impact of test-time scaling on temporal reasoning capabilities.
Additionally, we release TimE-Lite, a human-annotated subset to foster future research and standardized evaluation in temporal reasoning.
Poster