3 papers across 3 sessions
We incorporate lie detectors into the labelling step of preference learning and characterize the factors that lead the trained policy to be honest or to evade the detector.
We train probes on activations to classify high- vs low-stakes scenarios, find they outperform medium-sized fine-tuned LLMs, and consider applications to monitoring.
We reliably predict the behavior of black-box language models by training predcitors on their responses to follow-up questions.