Frontier AI companies increasingly rely on external evaluations to assess risks from dangerous capabilities before deployment. However, external evaluators often receive limited model access, limited information, and little time, which can reduce evaluation rigour and confidence. The EU General-Purpose AI Code of Practice calls for "appropriate access", but does not specify what this means in practice. Furthermore, there is no common framework for describing different types and levels of evaluator access. To address this gap, we propose a taxonomy of access methods for dangerous capability evaluations. We disentangle three aspects of access: model access, model information, and evaluation timeframe. For each aspect, we review benefits and risks, including how expanding access can reduce false negatives and improve stakeholder trust, but can also increase security and capacity challenges. We argue that these limitations can likely be mitigated through technical means and safeguards used in other industries. Based on the taxonomy, we propose three descriptive access levels: AL1 (black-box model access and minimal information), AL2 (grey-box model access and substantial information), and AL3 (white-box model access and comprehensive information), to support clearer communication between evaluators, frontier AI companies, and policymakers. We believe these levels correspond to the different standards for appropriate access defined in the EU Code of Practice, though these standards may change over time.
翻译:前沿AI公司日益依赖外部评估来部署前评估危险能力带来的风险。然而,外部评估者通常只能获得有限的模型访问权限、有限的信息和较短的时间,这可能会降低评估的严谨性和可信度。欧盟《通用人工智能行为准则》要求提供"适当访问",但并未具体说明这在实践中意味着什么。此外,目前尚无描述评估者访问类型和级别的通用框架。为填补这一空白,我们提出了一种用于危险能力评估的访问方法分类体系。我们解构了访问的三个维度:模型访问、模型信息和评估时间框架。针对每个维度,我们分析了其益处与风险,包括扩展访问如何减少假阴性结果并提升利益相关方信任,但同时也可能增加安全性和资源方面的挑战。我们认为,这些限制很可能通过其他行业使用的技术手段和保障措施得以缓解。基于该分类体系,我们提出了三个描述性访问级别:AL1(黑盒模型访问与最小信息)、AL2(灰盒模型访问与实质性信息)和AL3(白盒模型访问与全面信息),以支持评估者、前沿AI公司和政策制定者之间更清晰的沟通。我们相信这些级别对应欧盟《行为准则》中定义的不同适当访问标准,尽管这些标准可能会随时间而变化。