logo
today local_bar
Poster Session 4 · Thursday, December 4, 2025 4:30 PM → 7:30 PM
#5210

Safety Pretraining: Toward the Next Generation of Safe AI

NeurIPS OpenReview

Abstract

As large language models (LLMs) are increasingly deployed in high-stakes settings, the risk of generating harmful or toxic content remains a central challenge. Post-hoc alignment methods are brittle: once unsafe patterns are learned during pretraining, they are hard to remove.
In this work, we present a data-centric pretraining framework that builds safety into the model from the start. Our framework consists of four key steps:
  1. Safety Filtering: building a safety classifier to classify webdata into safe and unsafe categories;
  2. Safety Rephrasing: we recontextualize unsafe webdata into safer narratives;
  3. Native Refusal: we synthetically generate pretraining datasets that actively teach models to refuse on unsafe content and the moral reasoning behind it, and
  4. Harmfulness-Tag annotated pretraining: we flag unsafe content during pretraining using a special token, and use it to steer models away from unsafe generations at inference-time.
Our safety-pretrained models reduce attack success rates from 38.8% to 8.4% on standard LLM safety benchmarks with no performance degradation on general tasks.