1 paper across 1 session
We introduce ROVER, a recursive framework that improves the video reasoning accuracy and efficiency of vision-language models in embodied settings.