5 papers across 3 sessions
We propose an inference-strategy for long-video QA, which substantially improves the accuracy of a VLM by curating its input context.
A novel real-world benchmark for long dependency understanding.