There is a growing need to gain insight into language model capabilities that relate to sensitive topics, such as bioterrorism or cyberwarfare. However, traditional open source benchmarks are not fit for the task, due to the associated practice of publishing the correct answers in human-readable form. At the same time, enforcing mandatory closed-quarters evaluations might stifle development and erode trust. In this context, we propose hashmarking, a protocol for evaluating language models in the open without having to disclose the correct answers. In its simplest form, a hashmark is a benchmark whose reference solutions have been cryptographically hashed prior to publication. Following an overview of the proposed evaluation protocol, we go on to assess its resilience against traditional attack vectors (e.g. rainbow table attacks), as well as against failure modes unique to increasingly capable generative models.
翻译:随着对生物恐怖主义或网络战争等敏感话题的语言模型能力进行深入洞察的需求日益增长,然而传统开源基准测试由于会将正确答案以人类可读形式公开,并不适合此类任务。同时,强制实施封闭式评估可能抑制技术发展并削弱信任。在此背景下,我们提出"哈希标记"协议——一种无需披露正确答案即可进行公开语言模型评估的方法。其最简形式是在基准测试发布前,通过密码学哈希函数处理参考解决方案。在概述该评估协议后,我们进一步评估了其对传统攻击向量(如彩虹表攻击)以及日益强大的生成模型特有失效模式的抵御能力。