3 papers across 2 sessions
SAGE‑Eval is the first benchmark to test whether frontier LLMs robustly generalize critical safety knowledge to novel situations, and we show that the strongest model we tested only passed 58% of safety facts evaluated.
Reasoning models can learn rules from simple examples and be able to solve complex ones using the rules. We identify a broad class of everyday reasoning rules that current models cannot learn and build large datasets requiring such rule learning.