Online scams increasingly leverage fluent and context-aware social engineering strategies, creating growing demand for AI systems that explain why a message may be risky. However, explanations that cite detector-derived evidence may still semantically weaken or redirect the intended risk interpretation. We introduce VEXA: Verifying Semantic Explanation Alignment, a controlled testbed for studying the gap between lexical grounding and semantic risk alignment in AI-generated scam-risk explanations. VEXA generates ungrounded, risk-aligned, and risk-diluting explanations by independently controlling evidence grounding and semantic framing. Through LLM-as-a-judge and human evaluations, we show that explanations may continue to appear comparatively grounded even when their semantic interpretation weakens the detector's intended risk assessment. In human evaluation, risk-diluting XAI-grounded explanations retained comparatively elevated Perceived Evidence Grounding scores (3.66) despite lower Helpfulness (3.00) and Reasoning Support (3.14) scores. These findings provide controlled evidence of grounding illusion effects in AI-generated security explanations and suggest that trustworthy explanation evaluation must verify not only whether evidence is cited, but also how that evidence is interpreted.
翻译:在线诈骗越来越多地采用流畅且具有上下文感知的社会工程策略,这催生了对能够解释某条消息为何可能具有风险的AI系统的持续需求。然而,基于检测器证据生成的解释仍可能在语义上削弱或重新引导原本的风险解读。我们提出VEXA:验证语义解释对齐,这是一个受控的测试平台,用于研究AI生成的诈骗风险解释中词法依据与语义风险对齐之间的差距。VEXA通过独立控制证据依据和语义框架,生成无依据的、风险对齐的和风险稀释的解释。通过LLM作为评判者以及人类评估,我们证明,即使解释的语义解读削弱了检测器预期的风险评估,它们在外观上仍可能显得相对有据可依。在人类评估中,基于XAI依据但风险稀释的解释尽管助益性(3.00)和推理支持(3.14)得分较低,但其感知证据依据得分(3.66)仍保持在较高水平。这些发现为AI生成的安全解释中的依据错觉效应提供了受控证据,并表明可信赖的解释评估不仅需要验证是否引用了证据,还需要验证该证据是如何被解读的。