1 paper across 1 session
We propose a general reinforcement learning framework tailored for interleaved multimodal tasks by permutating image sequences to simulate varied positional relationships and explore more spatial and positional diversity