logo
today local_bar
Poster Session 1 · Wednesday, December 3, 2025 11:00 AM → 2:00 PM
#4816 Spotlight

JavisGPT: A Unified Multi-modal LLM for Sounding-Video Comprehension and Generation

NeurIPS OpenReview

Abstract

This paper presents JavisGPT, the first unified multimodal large language model (MLLM) for Joint Audio-Video (JAV) comprehension and generation. JavisGPT adopts a concise encoder–LLM–decoder architecture, featuring a SyncFusion module for spatio-temporal audio-video fusion and synchrony-aware learnable queries to bridge a pretrained JAV-DiT generator. This design enables temporally coherent video-audio understanding and generation from multimodal instructions.
We design an effective three-stage training pipeline consisting ofmultimodal pretraining, audio-video fine-tuning, and large-scale instruction-tuning, to progressively build multimodal comprehension and generation from existing vision-language models.
To support this, we further construct JavisInst-Omni, a high-quality instruction datasetwith over 200K GPT-4o-curated audio-video-text dialogues that span diverse and multi-level comprehension and generation scenarios.
Extensive experiments on JAV comprehension and generation benchmarks show that JavisGPT outperforms existing MLLMs, particularly in complex and temporally synchronized settings.