6 papers across 3 sessions
We introduce SpatialReasoner, a novel large vision-language model that improves 3D spatial reasoning with explicit 3D representations and generalizable 3D thinking.
ADMN enables layer-wise allocation of resources among modalities at inference time according to modality quality for a fixed resource budget.
Aha! enables real-time, autoregressive highlight detection in continuous video using language tasks. Using a VLM and a Dynamic SinkCache for constant memory, it achieves SOTA on benchmarks & shows robotics potential.