Frontier artificial intelligence (AI) systems pose increasing risks to society, making it essential for developers to provide assurances about their safety. One approach to offering such assurances is through a safety case: a structured, evidence-based argument aimed at demonstrating why the risk associated with a safety-critical system is acceptable. In this article, we propose a safety case template for offensive cyber capabilities. We illustrate how developers could argue that a model does not have capabilities posing unacceptable cyber risks by breaking down the main claim into progressively specific sub-claims, each supported by evidence. In our template, we identify a number of risk models, derive proxy tasks from the risk models, define evaluation settings for the proxy tasks, and connect those with evaluation results. Elements of current frontier safety techniques - such as risk models, proxy tasks, and capability evaluations - use implicit arguments for overall system safety. This safety case template integrates these elements using the Claims Arguments Evidence (CAE) framework in order to make safety arguments coherent and explicit. While uncertainties around the specifics remain, this template serves as a proof of concept, aiming to foster discussion on AI safety cases and advance AI assurance.
翻译:前沿人工智能系统对社会构成日益增长的风险,这使得开发者必须提供其安全性的保证。提供此类保证的一种方法是通过安全案例:一种结构化、基于证据的论证,旨在证明与安全关键系统相关的风险为何是可接受的。在本文中,我们提出了一种用于进攻性网络能力的安全案例模板。我们阐述了开发者如何通过将主要主张分解为逐步具体的子主张(每个子主张均有证据支持),来论证某个模型不具备构成不可接受网络风险的能力。在我们的模板中,我们识别了若干风险模型,从这些风险模型中推导出代理任务,为代理任务定义评估设置,并将这些设置与评估结果相关联。当前前沿安全技术中的要素——例如风险模型、代理任务和能力评估——对整体系统安全性使用了隐含的论证。此安全案例模板利用主张-论证-证据框架整合了这些要素,旨在使安全论证连贯且明确。尽管具体细节仍存在不确定性,但该模板作为一个概念验证,旨在促进关于人工智能安全案例的讨论,并推动人工智能保证的进展。