Redwood Research - NeurIPS 2025

today local_bar

🏛 Redwood Research

2 papers across 2 sessions

Poster Session 2

Wednesday, December 3, 2025 · 4:30 PM → 7:30 PM

Exhibit Hall C,D,E

Why Do Some Language Models Fake Alignment While Others Don't?

#1307 Spotlight · Abhay Sheshadri, John Hughes, Julian Michael, Alex Mallen, Arun Jose, Fabien Roger

We analyze alignment faking propensities in 23 LLMs, and attempt to explain why some LLMs fake alignment and others don't.

Poster Session 4

Thursday, December 4, 2025 · 4:30 PM → 7:30 PM

Exhibit Hall C,D,E

Combining Cost Constrained Runtime Monitors for AI Safety

#1008 · Tim Hua, James Baskerville, Henri Lemoine, Mia Hopman, Aryan Bhatt, Tyler Tracy

Optimally constructing monitoring protocols with multiple monitors of varying costs and performances.