Center for AI Safety - NeurIPS 2025

🏛 Center for AI Safety

2 papers across 2 sessions

Poster Session 3

Thursday, December 4, 2025 · 11:00 AM → 2:00 PM

Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs

#1412 Spotlight · Mantas Mazeika, Xuwang Yin, Rishub Tamirisa, Jaehyuk Lim, Bruce W, Lee, Richard Ren, Long Phan, Norman Mu, Oliver Zhang, Dan Hendrycks

We discover that coherent value systems emerge with scale in LLMs and propose the research avenue of utility engineering to analyze and control these emergent value systems.

Poster Session 4

1 paper

Thursday, December 4, 2025 · 4:30 PM → 7:30 PM

Exhibit Hall C,D,E

Safety Pretraining: Toward the Next Generation of Safe AI

#5210 · Pratyush Maini, Sachin Goyal, Dylan Sam, Alexander Robey, Yash Savani, Yiding Jiang, Andy Zou, Matt Fredrikson, Zachary Lipton, Zico Kolter

We present a data-centric pretraining framework that builds safety into the model from the start