Following the rapid increase in Artificial Intelligence (AI) capabilities in recent years, the AI community has voiced concerns regarding possible safety risks. To support decision-making on the safe use and development of AI systems, there is a growing need for high-quality evaluations of dangerous model capabilities. While several attempts to provide such evaluations have been made, a clear definition of what constitutes a "good evaluation" has yet to be agreed upon. In this practitioners' perspective paper, we present a set of best practices for safety evaluations, drawing on prior work in model evaluation and illustrated through cybersecurity examples. We first discuss the steps of the initial thought process, which connects threat modeling to evaluation design. Then, we provide the characteristics and parameters that make an evaluation useful. Finally, we address additional considerations as we move from building specific evaluations to building a full and comprehensive evaluation suite.
翻译:随着近年来人工智能(AI)能力的快速提升,AI 社区对潜在的安全风险表达了担忧。为支持 AI 系统安全使用与发展的决策,对模型危险能力进行高质量评估的需求日益增长。尽管已有若干提供此类评估的尝试,但关于何为“良好评估”的明确定义尚未达成共识。在这篇实践者视角的论文中,我们借鉴模型评估领域的先前工作,并通过网络安全示例加以说明,提出了一套安全评估的最佳实践。我们首先讨论了将威胁建模与评估设计相连接的初始思维过程步骤。接着,我们阐述了使评估具有实用性的特征与参数。最后,我们探讨了从构建特定评估扩展到构建完整且全面评估套件时需考虑的额外因素。