There is a growing need to gain insight into language model capabilities that relate to sensitive topics, such as bioterrorism or cyberwarfare. However, traditional open source benchmarks are not fit for the task, due to the associated practice of publishing the correct answers in human-readable form. At the same time, enforcing mandatory closed-quarters evaluations might stifle development and erode trust. In this context, we propose hashmarking, a protocol for evaluating language models in the open without having to disclose the correct answers. In its simplest form, a hashmark is a benchmark whose reference solutions have been cryptographically hashed prior to publication. Following an overview of the proposed evaluation protocol, we go on to assess its resilience against traditional attack vectors (e.g. rainbow table attacks), as well as against failure modes unique to increasingly capable generative models.
翻译:当前,获取与生物恐怖主义或网络战争等敏感话题相关的语言模型能力洞察的需求日益增长。然而,传统开源基准因需以人类可读形式公布正确答案的实践而无法胜任此任务。与此同时,强制推行封闭式评估可能抑制发展并削弱信任。在此背景下,我们提出哈希标记(hashmarking)协议——一种无需披露正确答案即可在开放环境中评估语言模型的方法。其最简形式为:在发布前对基准测试的参考解决方案进行加密哈希处理。在概述该评估协议后,我们进一步验证了其对传统攻击向量(如彩虹表攻击)以及日益强大的生成模型特有故障模式的抵御能力。