Most adversarial threats in artificial intelligence target the computational behavior of models rather than the humans who rely on them. Yet modern AI systems increasingly operate within human decision loops, where users interpret and act on model recommendations. Large Language Models generate fluent natural-language explanations that shape how users perceive and trust AI outputs, revealing a new attack surface at the cognitive layer: the communication channel between AI and its users. We introduce adversarial explanation attacks (AEAs), where an attacker manipulates the framing of LLM-generated explanations to modulate human trust in incorrect outputs. We formalize this behavioral threat through the trust miscalibration gap, a metric that captures the difference in human trust between correct and incorrect outputs under adversarial explanations. By incorporating this gap, AEAs explore the daunting threats in which persuasive explanations reinforce users' trust in incorrect predictions. To characterize this threat, we conducted a controlled experiment (n = 205), systematically varying four dimensions of explanation framing: reasoning mode, evidence type, communication style, and presentation format. Our findings show that users report nearly identical trust for adversarial and benign explanations, with adversarial explanations preserving the vast majority of benign trust despite being incorrect. The most vulnerable cases arise when AEAs closely resemble expert communication, combining authoritative evidence, neutral tone, and domain-appropriate reasoning. Vulnerability is highest on hard tasks, in fact-driven domains, and among participants who are less formally educated, younger, or highly trusting of AI. This is the first systematic security study that treats explanations as an adversarial cognitive channel and quantifies their impact on human trust in AI-assisted decision making.
翻译:人工智能领域的大多数对抗性威胁针对的是模型的计算行为,而非依赖模型的人类。然而,现代AI系统日益在人类决策循环中运行,用户在此过程中解读并依据模型推荐采取行动。大型语言模型生成流畅的自然语言解释,这些解释塑造了用户对AI输出的感知与信任,从而在认知层面揭示了一个新的攻击面:AI与其用户之间的沟通渠道。我们提出了对抗性解释攻击,攻击者通过操控LLM生成解释的表述框架来调节人类对错误输出的信任。我们通过信任校准偏差这一指标来形式化描述这种行为威胁,该指标捕捉了在对抗性解释下,人类对正确输出与错误输出的信任差异。通过纳入这一偏差,AEA探索了令人担忧的威胁场景,即具有说服力的解释会强化用户对错误预测的信任。为了刻画这一威胁,我们进行了一项受控实验(n = 205),系统性地改变了解释表述的四个维度:推理模式、证据类型、沟通风格和呈现格式。我们的研究结果表明,用户对对抗性解释和良性解释报告的信任度几乎相同,尽管对抗性解释是错误的,但仍保留了良性解释中的绝大部分信任。最脆弱的案例出现在当AEA与专家沟通高度相似时,即结合了权威证据、中性语气和领域适宜的推理。在困难任务、事实驱动领域,以及在那些正规教育程度较低、较年轻或对AI高度信任的参与者中,脆弱性最高。这是首个将解释视为对抗性认知通道,并量化其对人类在AI辅助决策中信任影响的系统性安全研究。