1 paper across 1 session
We introduce an unsupervised method for post-training multi-modal large language models using implicit reward signals from majority voting based on GRPO.