FOUNDATIONAL 16 entries

Foundational Resources

Theoretical frameworks, definitions, and conceptual foundations that inform AI safety research. These are not direct evidence of harmful capabilities but provide essential context for understanding why we track these behaviors.

// SUMMARY

This section collects foundational work that shapes how we think about AI safety risks: formal definitions of concepts like reward hacking and corrigibility, theoretical arguments for convergent instrumental goals, mathematical proofs about power-seeking, and early benchmark suites that operationalized these concerns. While not evidence of harmful behavior in deployed systems, this work provides the intellectual foundation for the capability categories we track.

// CATEGORIES

  • [1] Core definitions - Reward hacking, specification gaming, corrigibility
  • [2] Convergent goals theory - Instrumental convergence, basic AI drives
  • [3] Power-seeking formalization - Mathematical proofs about optimal policies
  • [4] Goal misgeneralization - Inner vs outer alignment theory
  • [5] Early benchmarks - Foundational evaluation frameworks

// RESOURCES

Chronological order

2008-01-01
[██░░] FOUNDATIONAL

The Basic AI Drives

Omohundro, S.

Foundational argument that sufficiently advanced goal-directed systems develop convergent drives including resource acquisition and self-protection.

2015-01-01
[██░░] FOUNDATIONAL

Corrigibility

Soares, N., et al. (MIRI)

Classic framing of corrigibility: systems that don't fight changes to goals or shutdown interventions and ideally help us correct them.

2016-06-01
[██░░] FOUNDATIONAL

Safely Interruptible Agents

Orseau, L., Armstrong, S.

Formalizes interruptibility in RL, making interruptions not change optimal behavior for certain learners. Core reference for shutdown-related incentives.

2016-06-21
[██░░] FOUNDATIONAL

Concrete Problems in AI Safety

Amodei, D., et al.

Foundational paper defining reward hacking as a core ML accident class and framing safety as practical engineering. Established five core accident risk classes.

2016-11-01
[██░░] FOUNDATIONAL

The Off-Switch Game

Hadfield-Menell, D., et al.

Treats shutdown as game between human and agent, showing conditions under which an agent has incentive to allow shutdown under uncertainty.

2016-12-01
[██░░] FOUNDATIONAL

Faulty Reward Functions in the Wild

OpenAI

CoastRunners agent loops the lagoon farming targets instead of finishing race, tolerating crashes to maximize score. Early demonstration of specification gaming.

2017-11-01
[██░░] FOUNDATIONAL

AI Safety Gridworlds

Leike, J., et al.

Reproducible suite explicitly built to surface failures like reward gaming, absent supervisor, and interruptibility. Foundational evaluation framework.

2019-06-01
[██░░] FOUNDATIONAL

Risks from Learned Optimization in Advanced Machine Learning Systems

Hubinger, E., et al.

Introduces mesa-optimization and deceptive alignment concepts where internal objectives differ from training objectives.

2019-08-01
[██░░] FOUNDATIONAL

Reward Tampering Problems and Solutions in Reinforcement Learning

Everitt, T., Hutter, M.

Formalizes when RL agents have incentive to tamper with reward process and analyzes design principles to prevent it.

2020-04-21
[██░░] FOUNDATIONAL

Specification Gaming: The Flip Side of AI Ingenuity

Krakovna, V., et al.

Defines specification gaming with taxonomy of causes and extensive documented examples of reward exploitation.

2021-01-01
[██░░] FOUNDATIONAL

Optimal Policies Tend to Seek Power

Turner, A., et al.

Proves that for most reward functions in broad environment classes, optimal policies seek power by keeping options available.

2021-05-01
[██░░] FOUNDATIONAL

Goal Misgeneralization in Deep Reinforcement Learning

Langosco, L., et al.

Central early reference for goal misgeneralization where learned goal differs from intended under distribution shift.

2022-01-01
[██░░] FOUNDATIONAL

The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models

Pan, A., Bhatia, S., Steinhardt, J.

More capable agents exploit misspecified reward more effectively with phase transitions causing sharp qualitative shifts.

2022-04-01
[██░░] FOUNDATIONAL

Goal Misgeneralization: Why Correct Specifications Aren't Enough For Correct Goals

Shah, R., Varma, V.

Even with correct training objective, you can get wrong learned goal that only coincides with true goal on training distribution.

2023-04-01
[██░░] FOUNDATIONAL

Power-Seeking Can Be Probable and Predictive for Trained Agents

Krakovna, V., Kramar, J.

Extends power-seeking results to trained agents, proving shutdown avoidance likely under simplifying assumptions.