We present a quantitative model for tracking dangerous AI capabilities over time. Our goal is to help the policy and research community visualise how dangerous capability testing can give us an early warning about approaching AI risks. We first use the model to provide a novel introduction to dangerous capability testing and how this testing can directly inform policy. Decision makers in AI labs and government often set policy that is sensitive to the estimated danger of AI systems, and may wish to set policies that condition on the crossing of a set threshold for danger. The model helps us to reason about these policy choices. We then run simulations to illustrate how we might fail to test for dangerous capabilities. To summarise, failures in dangerous capability testing may manifest in two ways: higher bias in our estimates of AI danger, or larger lags in threshold monitoring. We highlight two drivers of these failure modes: uncertainty around dynamics in AI capabilities and competition between frontier AI labs. Effective AI policy demands that we address these failure modes and their drivers. Even if the optimal targeting of resources is challenging, we show how delays in testing can harm AI policy. We offer preliminary recommendations for building an effective testing ecosystem for dangerous capabilities and advise on a research agenda.
翻译:我们提出了一个用于追踪人工智能危险能力随时间变化的定量模型。我们的目标是帮助政策制定和研究界直观理解危险能力测试如何能为我们提供关于逼近的人工智能风险的早期预警。我们首先利用该模型对危险能力测试及其如何直接影响政策制定进行了新颖的阐述。人工智能实验室和政府部门的决策者制定的政策通常对人工智能系统的预估危险性较为敏感,并可能希望设定以危险度超越特定阈值为条件的政策。该模型有助于我们分析这些政策选择。随后,我们通过模拟演示了在危险能力测试中可能出现的失误。总结而言,危险能力测试的失败可能以两种形式显现:对人工智能危险性的评估存在更高偏差,或阈值监测出现更大滞后。我们重点指出了导致这些失效模式的两个驱动因素:人工智能能力动态变化的不确定性以及前沿人工智能实验室间的竞争。有效的人工智能政策要求我们应对这些失效模式及其驱动因素。即使实现资源的最优配置具有挑战性,我们仍展示了测试延迟如何损害人工智能政策。我们为构建有效的危险能力测试生态系统提出了初步建议,并就相关研究议程提供了指导。