Recent research demonstrating AI systems exhibiting deception and shutdown resistance suggests that AI loss of control (LOC) is an urgent policy concern , yet current literature focuses almost exclusively on alignment and prevention. To address this gap, this paper introduces a foundational framework and taxonomy for managing catastrophic AI LOC incidents. The taxonomy's first level distinguishes between scenarios where regaining control is 'extremely costly' versus 'impossible'. While impossible scenarios demand immediate resilience investments to fundamentally restrict an AI's attack surface , extremely costly scenarios require active incident management via Containment and Threat Neutralization. The framework further categorizes these manageable events into accidental LOC (requiring automated circuit-breaker responses) and adversarial LOC (requiring graduated escalatory measures). By mapping three severity classes to specific scenario matrices, this paper provides a concrete, proportional guide for managing unprecedented AI risks.
翻译:近期研究表明,具备欺骗与抗关机能力的AI系统提示AI失控已成为紧迫的政策关切,然而现有文献几乎完全聚焦于对齐与预防。为弥补这一空白,本文提出了管理灾难性AI失控事件的基础框架与分类体系。该分类体系的第一层区分了"代价极其高昂"与"无法挽回"两种控制恢复场景。针对无法挽回场景需要立即进行韧性投资以根本性限制AI的攻击面,而代价极其高昂场景则要求通过遏制与威胁消除进行主动事件管理。该框架进一步将这些可控事件划分为意外失控(需启用自动化断路器响应)与对抗性失控(需采取渐进升级应对措施)。通过将三类严重等级映射至具体场景矩阵,本文为管理前所未有的AI风险提供了具体且成比例的指导方案。