Theoretical frameworks, definitions, and conceptual foundations that inform AI safety research. These are not direct evidence of harmful capabilities but provide essential context for understanding why we track these behaviors.
This section collects foundational work that shapes how we think about AI safety risks: formal definitions of concepts like reward hacking and corrigibility, theoretical arguments for convergent instrumental goals, mathematical proofs about power-seeking, and early benchmark suites that operationalized these concerns. While not evidence of harmful behavior in deployed systems, this work provides the intellectual foundation for the capability categories we track.
Chronological order
Foundational argument that sufficiently advanced goal-directed systems develop convergent drives including resource acquisition and self-protection.
Classic framing of corrigibility: systems that don't fight changes to goals or shutdown interventions and ideally help us correct them.
Formalizes interruptibility in RL, making interruptions not change optimal behavior for certain learners. Core reference for shutdown-related incentives.
Foundational paper defining reward hacking as a core ML accident class and framing safety as practical engineering. Established five core accident risk classes.
Treats shutdown as game between human and agent, showing conditions under which an agent has incentive to allow shutdown under uncertainty.
CoastRunners agent loops the lagoon farming targets instead of finishing race, tolerating crashes to maximize score. Early demonstration of specification gaming.
Reproducible suite explicitly built to surface failures like reward gaming, absent supervisor, and interruptibility. Foundational evaluation framework.
Introduces mesa-optimization and deceptive alignment concepts where internal objectives differ from training objectives.
Formalizes when RL agents have incentive to tamper with reward process and analyzes design principles to prevent it.
Defines specification gaming with taxonomy of causes and extensive documented examples of reward exploitation.
Proves that for most reward functions in broad environment classes, optimal policies seek power by keeping options available.
Central early reference for goal misgeneralization where learned goal differs from intended under distribution shift.
More capable agents exploit misspecified reward more effectively with phase transitions causing sharp qualitative shifts.
Even with correct training objective, you can get wrong learned goal that only coincides with true goal on training distribution.
Extends power-seeking results to trained agents, proving shutdown avoidance likely under simplifying assumptions.