SECA: Semantically Equivalent and Coherent Attacks for Eliciting LLM Hallucinations

Large Language Models (LLMs) are increasingly deployed in high-risk domains. However, state-of-the-art LLMs often exhibit hallucinations, raising serious concerns about their reliability. Prior work has explored adversarial attacks to elicit hallucinations in LLMs, but these methods often rely on unrealistic prompts, either by inserting nonsensical tokens or by altering the original semantic intent. Consequently, such approaches provide limited insight into how hallucinations arise in real-world settings. In contrast, adversarial attacks in computer vision typically involve realistic modifications to input images. However, the problem of identifying realistic adversarial prompts for eliciting LLM hallucinations remains largely underexplored. To address this gap, we propose Semantically Equivalent and Coherent Attacks (SECA), which elicit hallucinations via realistic modifications to the prompt that preserve its meaning while maintaining semantic coherence. Our contributions are threefold: (i) we formulate finding realistic attacks for hallucination elicitation as a constrained optimization problem over the input prompt space under semantic equivalence and coherence constraints; (ii) we introduce a constraint-preserving zeroth-order method to effectively search for adversarial yet feasible prompts; and (iii) we demonstrate through experiments on open-ended multiple-choice question answering tasks that SECA achieves higher attack success rates while incurring almost no semantic equivalence or semantic coherence errors compared to existing methods. SECA highlights the sensitivity of both open-source and commercial gradient-inaccessible LLMs to realistic and plausible prompt variations. Code is available at https://github.com/Buyun-Liang/SECA.

翻译：大型语言模型（LLMs）正日益部署于高风险领域。然而，最先进的大型语言模型常出现幻觉现象，引发了对其可靠性的严重担忧。先前研究已探索通过对抗性攻击来诱发大型语言模型的幻觉，但这些方法通常依赖不切实际的提示，例如插入无意义的标记或改变原始语义意图。因此，这类方法对理解幻觉在真实场景中如何产生提供的见解有限。相比之下，计算机视觉领域的对抗性攻击通常涉及对输入图像进行符合现实的修改。然而，针对引发大型语言模型幻觉的现实对抗性提示的识别问题，目前仍未得到充分探索。为填补这一空白，我们提出语义等价与连贯性攻击（SECA），该方法通过对提示进行保持原意且语义连贯的现实修改来诱发幻觉。我们的贡献包括三个方面：（一）我们将寻找用于诱发幻觉的现实攻击形式化为在语义等价与连贯性约束下对输入提示空间的约束优化问题；（二）我们提出一种保持约束的零阶方法，以有效搜索对抗性且可行的提示；（三）通过在开放式多项选择题回答任务上的实验，我们证明相较于现有方法，SECA在几乎不产生语义等价或语义连贯性误差的同时，实现了更高的攻击成功率。SECA揭示了开源及商业梯度不可访问的大型语言模型对现实且合理的提示变动的敏感性。代码发布于 https://github.com/Buyun-Liang/SECA。