5 papers across 2 sessions
First multi-modal (vision, audio, digital context, longitudinal) scripted dataset and benchmark for goal inference for wearable assistant agents.
We address the challenge of predicting where a person's visual perception will focus next within their three-dimensional environment.
We introduce IndEgo, a multimodal dataset of industrial tasks, enabling research on collaboration, task understanding, and mistake detection.