From Defender to Devil? Unintended Risk Interactions Induced by LLM Defenses

Large Language Models (LLMs) have shown remarkable performance across various applications, but their deployment in real-world settings faces several risks, including jailbreak attacks and privacy leaks. To mitigate these risks, numerous defense strategies have been proposed. However, most existing studies assess these defenses in isolation and ignore their effects on other risk dimensions. In this work, we introduce a new cross-risk evaluation paradigm and take the first step in investigating unintended interactions among defenses in LLMs. Specifically, we focus on the interplay between safety, fairness, and privacy. To this end, we propose CrossRiskEval, a framework that systematically characterizes how a defense designed for one risk (e.g., safety) affects others (e.g., fairness or privacy). We conduct extensive empirical studies and mechanistic analyses on 14 LLMs with deployed defenses, covering 12 defense strategies. Our results show that defenses targeting a single risk often cause measurable effects on other risks. These effects vary in direction and magnitude across a range of factors (e.g., models, tasks, and defense strategies), and are often asymmetric across risk pairs. Furthermore, our mechanistic analysis shows that these interactions are not random: they arise from conflict-entangled neurons, which are shared internal representations that contribute in opposite ways to different risks. Adjusting one risk therefore perturbs these representations and leads to systematic changes in non-target risks. These findings reveal the limits of single-risk evaluation and highlight the need for holistic and interaction-aware assessment when designing and deploying LLM defenses.

翻译：大语言模型（LLMs）在各种应用中展现出卓越性能，但其在现实场景中的部署面临多重风险，包括越狱攻击和隐私泄露。为缓解这些风险，研究者提出了众多防御策略。然而，现有研究大多孤立评估这些防御机制，忽略了其对其他风险维度的影响。本研究提出一种新的跨风险评估范式，首次系统探究大语言模型防御机制间的意外交互作用，重点关注安全性、公平性与隐私性三者间的相互影响。为此，我们提出CrossRiskEval框架，用于系统刻画针对单一风险（如安全性）的防御策略如何影响其他风险（如公平性或隐私性）。通过对14个部署防御机制的大语言模型开展大规模实证研究与机理分析，覆盖12类防御策略，我们发现：针对单一风险的防御措施常对其他风险产生可量化的影响。这些影响的方向与强度受模型、任务及防御策略等多重因素调节，且在不同风险对之间常呈现非对称性。机理分析进一步表明，此类交互并非随机现象，其根源在于冲突纠缠神经元——这些共享的内部表征以相反方式作用于不同风险。因此，调整某一风险会扰动这些表征，进而导致非目标风险的系统性变化。本研究揭示了单一风险评估的局限性，强调在设计与部署大语言模型防御机制时，需采用整体性且关注交互效应的评估方法。