Ensuring the safety and alignment of Large Language Models is a significant challenge with their growing integration into critical applications and societal functions. While prior research has primarily focused on jailbreak attacks, less attention has been given to non-adversarial failures that subtly emerge during benign interactions. We introduce secondary risks a novel class of failure modes marked by harmful or misleading behaviors during benign prompts. Unlike adversarial attacks, these risks stem from imperfect generalization and often evade standard safety mechanisms. To enable systematic evaluation, we introduce two risk primitives verbose response and speculative advice that capture the core failure patterns. Building on these definitions, we propose SecLens, a black-box, multi-objective search framework that efficiently elicits secondary risk behaviors by optimizing task relevance, risk activation, and linguistic plausibility. To support reproducible evaluation, we release SecRiskBench, a benchmark dataset of 650 prompts covering eight diverse real-world risk categories. Experimental results from extensive evaluations on 16 popular models demonstrate that secondary risks are widespread, transferable across models, and modality independent, emphasizing the urgent need for enhanced safety mechanisms to address benign yet harmful LLM behaviors in real-world deployments.
翻译:随着大型语言模型日益融入关键应用和社会功能,确保其安全性与对齐性已成为重大挑战。现有研究主要集中于越狱攻击,而对良性交互中微妙显现的非对抗性失效关注不足。本文提出"次级风险"这一新型失效模式,其特征是在良性提示下产生有害或误导性行为。与对抗攻击不同,此类风险源于不完美的泛化能力,且常规避标准安全机制。为实现系统化评估,我们引入两个风险基元——冗长响应与推测性建议——以捕捉核心失效模式。基于这些定义,我们提出SecLens框架:一种通过优化任务相关性、风险激活度与语言合理性来高效诱发次级风险行为的黑盒多目标搜索框架。为支持可复现评估,我们开源SecRiskBench基准数据集,包含涵盖八类现实风险场景的650条提示。对16个主流模型的广泛实验表明,次级风险普遍存在、具有模型间可迁移性且与模态无关,这凸显了在实际部署中亟需加强安全机制以应对良性却有害的LLM行为。