Backchaining Loss of Control Mitigations from Mission-Specific Benchmarks in National Security

Affordances and permissions are promising and timely safety levers for mitigating Loss of Control (LoC) threats in high-stakes deployment contexts, such as national security. Deployers in defense and intelligence could rely on several approaches to identify which affordances and permissions should be prioritized, such as structured threat modelling, pre-deployment agentic evaluations, post-deployment continuous monitoring, and AI safety cases. This paper proposes a complementary and empirical methodology that leverages existing use-case-specific benchmarks: backchaining LoC mitigations from the errors an AI system makes on national security benchmarks. The approach proceeds in three steps and allows national security deployers to start building LoC mitigations today, from evidence they can generate themselves. First, deployers evaluate AI systems on mission-specific benchmarks approximating real use-cases. Second, deployers concentrate on the incorrect responses that the AI system provides to the benchmark questions, and backchain the affordances and permissions that would enable the AI system to cause downstream harm if it pursued the actions described in the incorrect answers. Third, deployers intervene selectively on those affordances and permissions, bottlenecking the paths to harm while preserving the AI system's ability to carry out the correct action. We illustrate this methodology through a demonstrative benchmark question on derivative security classification.

翻译：赋权与权限是及时且富有前景的安全杠杆，可缓解高风险部署场景（如国家安全）中的控制丧失威胁。国防与情报领域的部署者可通过结构化威胁建模、部署前智能体评估、部署后持续监控及人工智能安全案例等多种方法，确定应优先处理的赋权与权限。本文提出一种基于经验性的补充方法论，利用现有用例特定基准：从人工智能系统在国家安全基准上的错误表现中后链损失控制缓解措施。该方法分三步实施，使国家安全部署者能基于自身生成的实证依据，立即着手构建控制丧失缓解方案。首先，部署者在近似真实用例的任务特定基准上评估人工智能系统。其次，部署者聚焦系统对基准问题的错误响应，反向推导若系统执行错误答案所描述的行为时，可能使其造成下游危害的赋权与权限。最后，部署者针对这些赋权与权限进行选择性干预，在保留系统正确执行能力的同时，阻断危害传导路径。我们通过一项关于衍生安全分类的示范性基准问题，阐明该方法论。

相关内容

关注 7110

人工智能杂志AI(Artificial Intelligence)是目前公认的发表该领域最新研究成果的主要国际论坛。该期刊欢迎有关AI广泛方面的论文，这些论文构成了整个领域的进步，也欢迎介绍人工智能应用的论文，但重点应该放在新的和新颖的人工智能方法如何提高应用领域的性能，而不是介绍传统人工智能方法的另一个应用。关于应用的论文应该描述一个原则性的解决方案，强调其新颖性，并对正在开发的人工智能技术进行深入的评估。官网地址：http://dblp.uni-trier.de/db/journals/ai/

《军事任务为中心网络安全风险评估中的不确定性》

专知会员服务

10+阅读 · 5月18日

《任务中心化指标：提升国防行动中人工智能系统的可靠性与稳健性》最新报告

专知会员服务

22+阅读 · 2月22日

确保国防任务中的人工智能安全：多层次方法

专知会员服务

16+阅读 · 1月21日

《生成式人工智能军事应用安全保障：弹性可信部署框架》2025最新50页slides

专知会员服务

26+阅读 · 2025年11月21日