1 paper across 1 session
a unified multimodal model purely based on discrete flow matching, achieving comparable performance with AR-based MLLMs