1 paper across 1 session
We present VideoHallu, a benchmark of over 3,000 synthetic videos with expert-crafted counterintuitive QA pairs, evaluating MLLMs' ability to detect perceptually obvious abnormalities often missed due to language priors.