2 papers across 2 sessions
We detect and remove backdoor samples in MLLM fine-tuning by identifying abnormal attention entropy patterns without requiring clean data or model modifications.
We introduce the Martingale Score, an unsupervised metric from Bayesian statistics, to show that reasoning in LLMs often leads to belief entrenchment rather than truth-seeking, and shows this score predicts ground-truth accuracy.