2 papers across 2 sessions
We present MedSG-Bench, the first benchmark for medical image sequence grounding, and introduce MedSG-188K and MedSeq-Grounder to facilitate future research in medical sequencial grounding and reasoning.
ViGoRL is a vision-language model trained with reinforcement learning to ground each reasoning step in image coordinates, improving performance on spatial and web-based reasoning tasks through better attention and visual verification.