Beyond Safe Data: Pretraining-Stage Alignment with Regular Safety Reflection
Quick Answer
The study introduces Safety Reflection Pretraining, enhancing safety alignment in LLMs by integrating self-monitoring during pretraining.
Quick Take
The study introduces Safety Reflection Pretraining, enhancing safety alignment in LLMs by integrating self-monitoring during pretraining. Experiments with a 1.7B model show improved safety classification accuracy and reduced attack success rates compared to traditional data filtering methods.
Key Points
- Safety Reflection Pretraining integrates short safety reflections into pretraining corpora.
- 1.7B models pretrained on FineWeb-Edu showed improved safety classification accuracy.
- The method significantly reduced success rates of inference-stage and finetuning attacks.
- A synthetic environment, MedSafetyWorld, demonstrated the method's effectiveness against unsafe behaviors.
- Findings suggest pretraining should shape behaviors from safe data, not just filter it.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 19168v1 Announce Type: new Abstract: To achieve deeper safety alignment for large language models (LLMs), recent efforts have studied how to push safety interventions earlier into the pretraining stage, primarily by filtering unsafe data or rewriting it into safer forms. We argue that pretraining-stage alignment should go beyond making the data safe: LLMs may compose seemingly benign knowledge and capabilities into unsafe behaviors.
To this end, we propose Safety Reflection Pretraining, a pretraining-stage alignment method which regularly inserts short safety reflections into pretraining corpora to integrate self-monitoring directly into language modeling, establishing a foundational capability that is subsequently reinforced by compatible post-training. Our experiments with 1.
7B models pretrained on FineWeb-Edu show that Safety Reflection Pretraining improves safety classification accuracy and substantially reduces the success rates of inference-stage and finetuning attacks. Complementary to our real-world experiments, we also introduce a fully controlled synthetic environment, MedSafetyWorld, with a clear definition of safety and a reasoning structure under which models can easily generalize unsafe behaviors from safe data.
Ablations in MedSafetyWorld further demonstrate a clear advantage of Safety Reflection Pretraining in preventing models from acting on unsafe behaviors generalized from safe data, compared with data filtering and rewriting. Taken together, our findings suggest that pretraining alignment should not only make the training data safe, but also shape the behaviors that models are likely to acquire from safe data.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.AI
See more →Arbor: Tree Search as a Cognition Layer for Autonomous Agents
Arbor introduces a multi-agent framework utilizing structured tree search for optimizing LLM inference, achieving up to 193% throughput-latency improvement compared to vendor-optimized systems. It employs an Orchestrator and Critic agent for stability and coordination, demonstrating hardware-agnostic performance with minimal variance.

