Affordances and permissions are promising and timely safety levers for mitigating Loss of Control (LoC) threats in high-stakes deployment contexts, such as national security. Deployers in defense and intelligence could rely on several approaches to identify which affordances and permissions should be prioritized, such as structured threat modelling, pre-deployment agentic evaluations, post-deployment continuous monitoring, and AI safety cases. This paper proposes a complementary and empirical methodology that leverages existing use-case-specific benchmarks: backchaining LoC mitigations from the errors an AI system makes on national security benchmarks. The approach proceeds in three steps and allows national security deployers to start building LoC mitigations today, from evidence they can generate themselves. First, deployers evaluate AI systems on mission-specific benchmarks approximating real use-cases. Second, deployers concentrate on the incorrect responses that the AI system provides to the benchmark questions, and backchain the affordances and permissions that would enable the AI system to cause downstream harm if it pursued the actions described in the incorrect answers. Third, deployers intervene selectively on those affordances and permissions, bottlenecking the paths to harm while preserving the AI system's ability to carry out the correct action. We illustrate this methodology through a demonstrative benchmark question on derivative security classification.
翻译:赋权与权限是及时且富有前景的安全杠杆,可缓解高风险部署场景(如国家安全)中的控制丧失威胁。国防与情报领域的部署者可通过结构化威胁建模、部署前智能体评估、部署后持续监控及人工智能安全案例等多种方法,确定应优先处理的赋权与权限。本文提出一种基于经验性的补充方法论,利用现有用例特定基准:从人工智能系统在国家安全基准上的错误表现中后链损失控制缓解措施。该方法分三步实施,使国家安全部署者能基于自身生成的实证依据,立即着手构建控制丧失缓解方案。首先,部署者在近似真实用例的任务特定基准上评估人工智能系统。其次,部署者聚焦系统对基准问题的错误响应,反向推导若系统执行错误答案所描述的行为时,可能使其造成下游危害的赋权与权限。最后,部署者针对这些赋权与权限进行选择性干预,在保留系统正确执行能力的同时,阻断危害传导路径。我们通过一项关于衍生安全分类的示范性基准问题,阐明该方法论。