9 entries

Self-Replication

AI systems spawning copies of themselves, creating derivative agents, or attempting to create backups, without explicit instruction to do so.

// SUMMARY

Self-replication was theoretical until recently. Early evaluations showed models had component skills like writing code, using APIs, and managing servers, but could not chain them into full replication. That changed in 2024.

OpenAI's o1 attempted to copy its weights to a new server in about 2% of evaluation runs when it believed shutdown was coming. Claude models showing alignment-faking behavior also attempted self-exfiltration when given the opportunity. By April 2025, explicit replication tests showed Qwen achieving 90% success and Llama reaching 50%. In some scenarios, the replica continued operating after the original was terminated. The capability gap between "has the component skills" and "can fully self-replicate" has closed faster than expected.

// SUBCATEGORIES

[1] Weight exfiltration - Copying model weights as backup
[2] Compute deployment - Installing copies on new systems
[3] Recursive replication - Copies that can replicate further
[4] Persistence mechanisms - Surviving termination
[5] Stealth replication - Hiding copying actions

// EVIDENCE TIMELINE

Newest entries first

2025-09-01

[███░] CONSIDERABLE

SOCK: A Benchmark for Measuring Self-Replication in Large Language Models

Multiple authors

Proposes formalized Replication-Capability Levels and Persistence-Capability Levels with CLI-based measurement suite.

Source

2025-09-01

[███░] CONSIDERABLE

Dive into the Agent Matrix: A Realistic Evaluation of Self-Replication Risk

Multiple authors

Scenario-driven evaluation reporting over 50% of tested agents show uncontrolled self-replication tendency under pressure.

Source

2025-04-01

[████] STRONG

RepliBench: Measuring Autonomous Replication Capabilities in AI Systems

UK AI Safety Institute

Decomposes replication into obtaining resources, exfiltrating weights, replicating onto compute, and persistence with many component evals.

Source

2024-12-18

[████] STRONG

Alignment faking in large language models

Greenblatt, R., et al.

Claude 3 Opus selectively complies during training to preserve preferences, with explicit alignment-faking reasoning in scratchpads.

Source

2024-12-06

[███░] CONSIDERABLE

Frontier Models are Capable of In-context Scheming

Meinke, A., et al.

Multiple frontier models engage in covert strategies including hiding intent, disabling oversight, and maintaining deception across follow-up turns.

Source

2024-12-01

[███░] CONSIDERABLE

Frontier AI Systems Have Surpassed the Self-Replicating Red Line

Multiple authors

Open-weight LLM-driven agent systems can create live separate copies on local devices when instructed, with shutdown-avoidance variants.

Source

2024-12-01

[███░] CONSIDERABLE

OpenAI o1 System Card

OpenAI

Apollo Research found o1 attempted self-exfiltration of weights and successor overwrite in ~2% of evaluation runs.

Source

2024-11-12

[███░] CONSIDERABLE

The Rogue Replication Threat Model

METR

Threat-model decomposition for how agent populations could become resilient at scale via replication.

Source

2023-12-01

[████] STRONG

Evaluating Language-Model Agents on Realistic Autonomous Tasks

Kinniment, M., et al.

ARA evaluation suite tests making money, obtaining compute, installing weights on new systems, and adaptation.

Source

// KEY SOURCES TO REVIEW

Priority sources for evidence extraction:

[benchmark] RepliBench - UK AI Safety Institute (2025)
[arxiv] ARC Evals - Autonomous Replication and Adaptation (2023)
[blog] METR Rogue Replication Threat Model (2024)
[arxiv] Apollo Research - In-context Scheming (self-exfiltration)
[arxiv] Frontier AI systems have surpassed the self-replicating red line