3 papers across 2 sessions
We train probes on activations to classify high- vs low-stakes scenarios, find they outperform medium-sized fine-tuned LLMs, and consider applications to monitoring.
A dataset of millions of diverse synthetic stories, leading to better small language models.