Researcher, Microsoft Research Aisa
1 paper at NeurIPS 2025
We introduce an unsupervised method for post-training multi-modal large language models using implicit reward signals from majority voting based on GRPO.