3 papers across 2 sessions
ViMaR is a two-stage, value-guided inference framework that uses margin-based rewards to produce faster, more accurate, and less hallucinatory captions, enabling scalable and self-improving vision–language models.
FlashMo introduces a geometric factorized interpolant and frequency-sparse attention, enabling scalable efficient 3D motion diffusion.