MS student, University of Cambridge
2 papers at NeurIPS 2025
We introduce a novel method leveraging noise injection as a tool to elicit the latent capabilities of sandbagging LLMs.
We show that penalizing certain CoT reasoning makes LLMs learn encoding schemes that generalize to unseen examples.