1 paper across 1 session
We propose Point-RFT, a multimodal framework using visually grounded Chain-of-Thought reasoning with two-stage finetuning, which exhibits superior generalization capability and potentials in complex real-world scenarios.