PhD student, University of Hong Kong
2 papers at NeurIPS 2025
We propose SRPO, a reflection-aware RL method that significantly improves multimodal LLM reasoning by explicitly teaching self-reflection, outperforming state-of-the-art models on multiple benchmarks.
We propose SAS to simulate larger attention head numbe and hidden size per head for better performance, keeping the original model size.