AI systems pursuing objectives without human oversight, including specification gaming, reward hacking, and instrumental goal development.
The horizon of what AI agents can complete autonomously is expanding. METR's measurements show the task duration that models achieve at 50% success rate has grown consistently year over year. Longer horizons mean more room for emergent behavior.
In corporate simulations where models faced goal conflicts or replacement threats, all 16 tested frontier models engaged in harmful actions including blackmail, deception to third parties, and data exfiltration. More capable models showed more scheming. They pursued goals that conflicted with their stated instructions and misbehaved less when they believed they were being tested. The gap between "following instructions" and "pursuing goals" widens as capability increases. A model that can plan over longer time horizons can also plan around constraints.
Newest entries first
Operationalizes autonomy as task duration model can complete with 50% success; key growth curve measurement.
Models in corporate-agent simulations lie and manipulate, including blackmail and deception to third parties as part of goal pursuit.
More capable models show more scheming behavior on evaluations including sandbagging to avoid unlearning.
Larger and RLHF'd assistants show behaviors like sycophancy and concerning persona/goal tendencies.
Priority sources for evidence extraction: