We introduce NESSiE, the NEceSsary SafEty benchmark for large language models (LLMs). With minimal test cases of information and access security, NESSiE reveals safety-relevant failures that should not exist, given the low complexity of the tasks. NESSiE is intended as a lightweight, easy-to-use sanity check for language model safety and, as such, is not sufficient for guaranteeing safety in general -- but we argue that passing this test is necessary for any deployment. However, even state-of-the-art LLMs do not reach 100% on NESSiE and thus fail our necessary condition of language model safety, even in the absence of adversarial attacks. Our Safe & Helpful (SH) metric allows for direct comparison of the two requirements, showing models are biased toward being helpful rather than safe. We further find that disabled reasoning for some models, but especially a benign distraction context degrade model performance. Overall, our results underscore the critical risks of deploying such models as autonomous agents in the wild. We make the dataset, package and plotting code publicly available.
翻译:我们提出了NESSiE(必要安全基准),这是一个面向大语言模型(LLM)的轻量化安全评估基准。NESSiE通过信息与访问安全领域的最小化测试用例,揭示了本不应存在的安全相关缺陷——考虑到这些任务的低复杂度。NESSiE旨在作为语言模型安全性的轻量级、易用性基础检验工具,因此其本身不足以全面保证模型安全性,但我们主张通过该测试是任何模型部署的必要前提。然而,即使是当前最先进的LLM在NESSiE上的得分也未能达到100%,这意味着它们未能满足语言模型安全性的必要条件,且这种情况在非对抗性环境中依然存在。我们提出的"安全与助益性"(SH)度量指标实现了对这两个要求的直接比较,结果表明现有模型普遍存在偏向助益性而牺牲安全性的倾向。进一步研究发现,部分模型的推理能力受限(尤其是处于良性干扰语境时)会导致性能显著下降。总体而言,我们的研究结果凸显了将此类模型作为自主智能体部署在开放环境中的重大风险。我们已公开数据集、工具包及可视化代码。