3 papers across 2 sessions
We propose a 3D full-body pose and cooking videos dataset along with multimodal behavior understanding benchmarks.
We present a unified audio-visual framework to uncover how humans and AI respond to modality conflicts and bias.