2 papers across 2 sessions
This paper introduces CERES that uses dual-modal causal intervention (backdoor adjustment for language bias and front-door adjustment with vision-depth fusion for visual bias) to achieve robust, SOTA egocentric referring video object segmentation.
Motion4D estimates consistent motion and semantics by integrating 2D models with explicit 4D reconstruction, outperforming SAM2 and state-of-the-art 2D tracking networks.