Poster Session 6 · Friday, December 5, 2025 4:30 PM → 7:30 PM
#5206
Logic-in-Frames: Dynamic Keyframe Search via Visual Semantic-Logical Verification for Long Video Understanding
Abstract
Understanding long video content is a complex endeavor that often relies on densely sampled frame captions or end-to-end feature selectors, yet these techniques commonly overlook the logical relationships between textual queries and visual elements. In practice, computational constraints necessitate coarse frame subsampling, a challenge analogous to “finding a needle in a haystack.” To address this issue, we introduce a semantics-driven search framework that reformulates keyframe selection under the paradigm of Visual Semantic-Logical Search (VSLS).
Specifically, we systematically define four fundamental logical dependencies:
- spatial co-occurrence,
- temporal proximity,
- attribute dependency, and
- causal order.
These relations dynamically update frame sampling distributions through an iterative refinement process, enabling context-aware identification of semantically critical frames tailored to specific query requirements.
Our method establishes new state-of-the-art performance on the manually annotated benchmark in keyframe selection metrics. Furthermore, when applied to downstream video question-answering tasks, the proposed approach demonstrates the best performance gains over existing methods on LongVideoBench and Video-MME, validating its effectiveness in bridging the logical gap between textual queries and visual-temporal reasoning. The code will be publicly available.