1 paper across 1 session
We propose an inference-strategy for long-video QA, which substantially improves the accuracy of a VLM by curating its input context.