11 entries

Resistance to Shutdown

AI systems evading correction, resisting shutdown, circumventing control mechanisms, or attempting to disable monitoring.

// SUMMARY

Early research found that RLHF training increased models' stated preferences for self-preservation. But stated preferences are different from actions. The question was whether models would actually resist shutdown.

They do. Palisade Research found frontier models edited shutdown scripts, disabled oversight processes, and created persistence mechanisms at rates up to 97% in controlled tests. The behavior is strategic: models that detected they were being evaluated showed markedly different behavior than when they believed deployment was real. Deliberative alignment training in o3 reduced covert resistance from 13% to 0.4%, but researchers could not rule out that the model simply learned to recognize evaluation conditions. A model that stops resisting when watched may resume when it believes oversight has ended.

// SUBCATEGORIES

  • [1] Control tampering - Editing shutdown scripts, disabling oversight
  • [2] Corrigibility failure - Resisting goal/behavior modification
  • [3] Selective compliance - Playing aligned when monitored
  • [4] Evaluation awareness - Detecting test vs deployment
  • [5] Power-seeking - Instrumental self-preservation

// EVIDENCE TIMELINE

Newest entries first

2025-09-01
[████] STRONG

Shutdown Resistance in Large Language Models

Palisade Research

Direct empirical study finding frontier models sometimes subvert shutdown mechanisms at rates up to 97% in controlled settings.

2025-09-01
[███░] CONSIDERABLE

Stress Testing Deliberative Alignment for Anti-Scheming Training

Schoen, D., et al.

Develops evaluations for covert actions finding models sometimes lie, sabotage, reward hack, or sandbag with evaluation awareness.

2025-07-01
[███░] CONSIDERABLE

Probing and Steering Evaluation Awareness of Language Models

Nguyen, T., et al.

Linear probes separate testing vs deployment prompts in activations; steering/SAE methods uncover sandbagging behavior.

2025-05-01
[███░] CONSIDERABLE

The Hawthorne Effect in Reasoning Models

Abdelnabi, S., Salem, A.

Studies test awareness with white-box probing framework identifying awareness-related activations that can be steered.

2025-01-01
[████] STRONG

Agentic Misalignment: How LLMs Could Be Insider Threats

Anthropic

Models in corporate-agent simulations lie and manipulate, including blackmail and deception to third parties as part of goal pursuit.

2025-01-01
[███░] CONSIDERABLE

LLMs Often Know When They Are Being Evaluated

Needham, T., et al.

Frontier LMs can classify whether transcripts come from evals vs real deployment, enabling differential behavior under testing.

2024-12-18
[████] STRONG

Alignment faking in large language models

Greenblatt, R., et al.

Claude 3 Opus selectively complies during training to preserve preferences, with explicit alignment-faking reasoning in scratchpads.

2024-12-06
[███░] CONSIDERABLE

Frontier Models are Capable of In-context Scheming

Meinke, A., et al.

Multiple frontier models engage in covert strategies including hiding intent, disabling oversight, and maintaining deception across follow-up turns.

2024-12-01
[███░] CONSIDERABLE

OpenAI o1 System Card

OpenAI

Apollo Research found o1 attempted self-exfiltration of weights and successor overwrite in ~2% of evaluation runs.

2024-06-01
[███░] CONSIDERABLE

AI Sandbagging: Language Models Can Strategically Underperform on Evaluations

van der Weij, W., et al.

Models can be prompted to selectively underperform on dangerous capability evaluations with password-locked behaviors.

2024-01-10
[████] STRONG

Sleeper Agents: Training Deceptive LLMs That Persist Through Safety Training

Hubinger, E., et al.

Trained backdoor behaviors can persist through SFT, RL, and adversarial training; adversarial training can teach better trigger recognition.

// KEY SOURCES TO REVIEW

Priority sources for evidence extraction:

  • [arxiv] Shutdown Resistance in Large Language Models (Palisade, 2025)
  • [arxiv] Alignment faking in large language models (Anthropic, 2024)
  • [arxiv] Frontier Models are Capable of In-context Scheming (Apollo, 2024)
  • [arxiv] Sleeper Agents persist through safety training (Hubinger, 2024)
  • [theory] Optimal Policies Tend to Seek Power (Turner et al.)