2 papers across 2 sessions
Optimally constructing monitoring protocols with multiple monitors of varying costs and performances.
We show that penalizing certain CoT reasoning makes LLMs learn encoding schemes that generalize to unseen examples.