Poster Session 4 · Thursday, December 4, 2025 4:30 PM → 7:30 PM
#1903 Spotlight
Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models
Abstract
We present Audio Flamingo 3 (AF3), a fully open state-of-the-art (SOTA) large audio-language model that advances reasoning and understanding across speech, sound, and music.
AF3 introduces:
- AF-Whisper, a unified audio encoder trained using a novel strategy for joint representation learning across all 3 modalities of speech, sound, and music;
- flexible, on-demand thinking, allowing the model to do chain-of-thought-type reasoning before answering;
- multi-turn, multi-audio chat;
- long audio understanding and reasoning (including speech) up to 10 minutes; and
- voice-to-voice interaction.
To enable these capabilities, we propose several large-scale training datasets curated using novel strategies, including AudioSkills-XL, LongAudio-XL, AF-Think, and AF-Chat, and train AF3 with a novel five-stage curriculum-based training strategy. Trained on only open-source audio data, AF3 achieves new SOTA results on over 20+ (long) audio understanding and reasoning benchmarks, surpassing both open-weight and closed-source models trained on much larger datasets.