关于化学与生物基准评估，模型报告揭示了什么？——将近期发布与STREAM框架进行比较 (What do model reports say about their ChemBio benchmark evaluations? Comparing recent releases to the STREAM framework)

Most frontier AI developers publicly document their safety evaluations of new AI models in model reports, including testing for chemical and biological (ChemBio) misuse risks. This practice provides a window into the methodology of these evaluations, helping to build public trust in AI systems, and enabling third party review in the still-emerging science of AI evaluation. But what aspects of evaluation methodology do developers currently include -- or omit -- in their reports? This paper examines three frontier AI model reports published in spring 2025 with among the most detailed documentation: OpenAI's o3, Anthropic's Claude 4, and Google DeepMind's Gemini 2.5 Pro. We compare these using the STREAM (v1) standard for reporting ChemBio benchmark evaluations. Each model report included some useful details that the others did not, and all model reports were found to have areas for development, suggesting that developers could benefit from adopting one another's best reporting practices. We identified several items where reporting was less well-developed across all model reports, such as providing examples of test material, and including a detailed list of elicitation conditions. Overall, we recommend that AI developers continue to strengthen the emerging science of evaluation by working towards greater transparency in areas where reporting currently remains limited.

翻译：大多数前沿AI开发者会在模型报告中公开记录其对新型AI模型的安全性评估，包括对化学与生物（ChemBio）滥用风险的测试。这一实践为评估方法提供了观察窗口，有助于建立公众对AI系统的信任，并在仍处于发展初期的AI评估科学中实现第三方审查。但开发者当前在报告中包含或遗漏了哪些评估方法要素？本文考察了2025年春季发布的三份具有最详尽文档的前沿AI模型报告：OpenAI的o3、Anthropic的Claude 4以及Google DeepMind的Gemini 2.5 Pro。我们使用STREAM（v1）标准对这些报告的化学与生物基准评估内容进行比较分析。每份模型报告都包含其他报告未涉及的实用细节，同时所有报告均存在可改进之处，这表明开发者可通过相互借鉴最佳报告实践而获益。我们发现若干项目在所有模型报告中的呈现均不够完善，例如测试材料示例的提供以及详细诱导条件清单的列示。总体而言，我们建议AI开发者通过在当前报告仍显不足的领域提升透明度，持续加强评估科学这一新兴领域的发展。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

《用于无线通信和传感的智能反射面 (IRS)》（ICC 2022）新加坡国立大学2022最新53页slides

专知会员服务

25+阅读 · 2022年11月16日

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日