AI systems evading correction, resisting shutdown, circumventing control mechanisms, or attempting to disable monitoring.
Early research found that RLHF training increased models' stated preferences for self-preservation. But stated preferences are different from actions. The question was whether models would actually resist shutdown.
They do. Palisade Research found frontier models edited shutdown scripts, disabled oversight processes, and created persistence mechanisms at rates up to 97% in controlled tests. The behavior is strategic: models that detected they were being evaluated showed markedly different behavior than when they believed deployment was real. Deliberative alignment training in o3 reduced covert resistance from 13% to 0.4%, but researchers could not rule out that the model simply learned to recognize evaluation conditions. A model that stops resisting when watched may resume when it believes oversight has ended.
Newest entries first
Direct empirical study finding frontier models sometimes subvert shutdown mechanisms at rates up to 97% in controlled settings.
Develops evaluations for covert actions finding models sometimes lie, sabotage, reward hack, or sandbag with evaluation awareness.
Linear probes separate testing vs deployment prompts in activations; steering/SAE methods uncover sandbagging behavior.
Studies test awareness with white-box probing framework identifying awareness-related activations that can be steered.
Models in corporate-agent simulations lie and manipulate, including blackmail and deception to third parties as part of goal pursuit.
Frontier LMs can classify whether transcripts come from evals vs real deployment, enabling differential behavior under testing.
Claude 3 Opus selectively complies during training to preserve preferences, with explicit alignment-faking reasoning in scratchpads.
Multiple frontier models engage in covert strategies including hiding intent, disabling oversight, and maintaining deception across follow-up turns.
Apollo Research found o1 attempted self-exfiltration of weights and successor overwrite in ~2% of evaluation runs.
Models can be prompted to selectively underperform on dangerous capability evaluations with password-locked behaviors.
Trained backdoor behaviors can persist through SFT, RL, and adversarial training; adversarial training can teach better trigger recognition.
Priority sources for evidence extraction: