monitoring

3 papers across 3 sessions

Poster Session 1

Wednesday, December 3, 2025 · 11:00 AM → 2:00 PM

Preference Learning with Lie Detectors can Induce Honesty or Evasion

We incorporate lie detectors into the labelling step of preference learning and characterize the factors that lead the trained policy to be honest or to evade the detector.

Poster Session 3

1 paper

Thursday, December 4, 2025 · 11:00 AM → 2:00 PM

Exhibit Hall C,D,E

Detecting High-Stakes Interactions with Activation Probes

#1112 · Alex McKenzie, Urja Pawar, Phil Blandfort, William Bankes, David Krueger, Ekdeep S Lubana, Dmitrii Krasheninnikov

We train probes on activations to classify high- vs low-stakes scenarios, find they outperform medium-sized fine-tuned LLMs, and consider applications to monitoring.

Poster Session 6

1 paper

Friday, December 5, 2025 · 4:30 PM → 7:30 PM

Exhibit Hall C,D,E

Predicting the Performance of Black-box Language Models with Follow-up Queries

#1304 · Dylan Sam, Marc Finzi, Zico Kolter

We reliably predict the behavior of black-box language models by training predcitors on their responses to follow-up questions.