Grounded but Misleading: Evaluating Semantic Alignment in AI-Generated Security Explanations

Online scams increasingly leverage fluent and context-aware social engineering strategies, creating growing demand for AI systems that explain why a message may be risky. However, explanations that cite detector-derived evidence may still semantically weaken or redirect the intended risk interpretation. We introduce VEXA: Verifying Semantic Explanation Alignment, a controlled testbed for studying the gap between lexical grounding and semantic risk alignment in AI-generated scam-risk explanations. VEXA generates ungrounded, risk-aligned, and risk-diluting explanations by independently controlling evidence grounding and semantic framing. Through LLM-as-a-judge and human evaluations, we show that explanations may continue to appear comparatively grounded even when their semantic interpretation weakens the detector's intended risk assessment. In human evaluation, risk-diluting XAI-grounded explanations retained comparatively elevated Perceived Evidence Grounding scores (3.66) despite lower Helpfulness (3.00) and Reasoning Support (3.14) scores. These findings provide controlled evidence of grounding illusion effects in AI-generated security explanations and suggest that trustworthy explanation evaluation must verify not only whether evidence is cited, but also how that evidence is interpreted.

翻译：在线诈骗越来越多地采用流畅且具有上下文感知的社会工程策略，这催生了对能够解释某条消息为何可能具有风险的AI系统的持续需求。然而，基于检测器证据生成的解释仍可能在语义上削弱或重新引导原本的风险解读。我们提出VEXA：验证语义解释对齐，这是一个受控的测试平台，用于研究AI生成的诈骗风险解释中词法依据与语义风险对齐之间的差距。VEXA通过独立控制证据依据和语义框架，生成无依据的、风险对齐的和风险稀释的解释。通过LLM作为评判者以及人类评估，我们证明，即使解释的语义解读削弱了检测器预期的风险评估，它们在外观上仍可能显得相对有据可依。在人类评估中，基于XAI依据但风险稀释的解释尽管助益性（3.00）和推理支持（3.14）得分较低，但其感知证据依据得分（3.66）仍保持在较高水平。这些发现为AI生成的安全解释中的依据错觉效应提供了受控证据，并表明可信赖的解释评估不仅需要验证是否引用了证据，还需要验证该证据是如何被解读的。

相关内容

关注 7110

人工智能杂志AI(Artificial Intelligence)是目前公认的发表该领域最新研究成果的主要国际论坛。该期刊欢迎有关AI广泛方面的论文，这些论文构成了整个领域的进步，也欢迎介绍人工智能应用的论文，但重点应该放在新的和新颖的人工智能方法如何提高应用领域的性能，而不是介绍传统人工智能方法的另一个应用。关于应用的论文应该描述一个原则性的解决方案，强调其新颖性，并对正在开发的人工智能技术进行深入的评估。官网地址：http://dblp.uni-trier.de/db/journals/ai/

《防务领域人工智能可信赖性：为防务开发负责任、符合伦理且可信赖的AI系统》欧洲防务局2025最新107页

专知会员服务

23+阅读 · 2025年5月14日

生成型大型语言模型的自动事实核查：一项综述

专知会员服务

37+阅读 · 2024年7月6日

覆盖800+文献、多位知名学者挂帅，北大联合剑桥、CMU等多所高校发布《AI 对齐 (Alignment)》全面性综述

专知会员服务

54+阅读 · 2023年11月1日

【普林斯顿博士论文】生成式人工智能的承诺与陷阱：以AI安全为中心的方法

专知会员服务

48+阅读 · 2023年7月23日