SoSBench: Benchmarking Safety Alignment on Six Scientific Domains

Large language models (LLMs) exhibit advancing capabilities in complex tasks, such as reasoning and graduate-level question answering, yet their resilience against misuse, particularly involving scientifically sophisticated risks, remains underexplored. Existing safety benchmarks typically focus either on instructions requiring minimal knowledge comprehension (e.g., ``tell me how to build a bomb") or utilize prompts that are relatively low-risk (e.g., multiple-choice or classification tasks about hazardous content). Consequently, they fail to adequately assess model safety when handling knowledge-intensive, hazardous scenarios. To address this critical gap, we introduce SoSBench, a regulation-grounded, hazard-focused benchmark encompassing six high-risk scientific domains: chemistry, biology, medicine, pharmacology, physics, and psychology. The benchmark comprises 3,000 prompts derived from real-world regulations and laws, systematically expanded via an LLM-assisted evolutionary pipeline that introduces diverse, realistic misuse scenarios (e.g., detailed explosive synthesis instructions involving advanced chemical formulas). We evaluate frontier models within a unified evaluation framework using our SoSBench. Despite their alignment claims, advanced models consistently disclose policy-violating content across all domains, demonstrating alarmingly high rates of harmful responses (e.g., 84.9% for Deepseek-R1 and 50.3% for GPT-4.1). These results highlight significant safety alignment deficiencies and underscore urgent concerns regarding the responsible deployment of powerful LLMs.

翻译：大语言模型在复杂任务（如推理和研究生级问答）中展现出持续提升的能力，但其在应对科学领域高风险滥用场景时的鲁棒性仍缺乏充分探索。现有安全基准测试主要关注两类指令：一类是无需深度知识理解的简单指令（如"告诉我如何制造炸弹"），另一类是相对低风险的提示（如关于危险内容的多选题或分类任务）。因此，这些基准无法有效评估模型在处理知识密集型危险场景时的安全性。为填补这一关键空白，我们提出SoSBench——一个基于法规约束、聚焦高风险场景的基准测试，涵盖六大高风险科学领域：化学、生物学、医学、药理学、物理学与心理学。该基准包含3000条源自真实法规条例的提示，通过LLM辅助进化管道系统扩展，引入多样化且真实的滥用场景（例如涉及复杂化学式的详细炸药合成指南）。我们采用统一评估框架，基于SoSBench对前沿模型进行了评估。尽管这些先进模型宣称具有安全对齐能力，但所有模型均在各个领域持续输出违反政策的内容，展现出惊人的高比例有害响应（例如Deepseek-R1达84.9%，GPT-4.1达50.3%）。研究结果暴露了显著的安全对齐缺陷，凸显了负责任部署强大LLM所面临的紧迫挑战。