LLMs are now an integral part of information retrieval. As such, their role as question answering chatbots raises significant concerns due to their shown vulnerability to adversarial man-in-the-middle (MitM) attacks. Here, we propose the first principled attack evaluation on LLM factual memory under prompt injection via Xmera, our novel, theory-grounded MitM framework. By perturbing the input given to "victim" LLMs in three closed-book and fact-based QA settings, we undermine the correctness of the responses and assess the uncertainty of their generation process. Surprisingly, trivial instruction-based attacks report the highest success rate (up to ~85.3%) while simultaneously having a high uncertainty for incorrectly answered questions. To provide a simple defense mechanism against Xmera, we train Random Forest classifiers on the response uncertainty levels to distinguish between attacked and unattacked queries (average AUC of up to ~94.8%). We believe that signaling users to be cautious about the answers they receive from black-box and potentially corrupt LLMs is a first checkpoint toward user cyberspace safety.
翻译:大语言模型(LLMs)现已成为信息检索系统的核心组成部分。作为问答型聊天机器人,其面临中间人(MitM)对抗攻击的脆弱性引发了严重关切。本文首次提出基于严格理论构建的Xmera新型中间人攻击框架,系统研究提示注入对LLM事实记忆的攻击评估。通过在三个闭卷事实问答场景中扰动目标LLM的输入,我们不仅破坏了响应正确性,还量化了生成过程的不确定性。令人惊讶的是,基于简单指令的攻击方法同时实现了最高成功率(约85.3%)与错误答案的高不确定性。为提供针对Xmera的简易防御机制,我们利用响应不确定性水平训练随机森林分类器以区分被攻击与未受攻击查询(平均AUC达约94.8%)。我们认为,向用户警示其从黑盒且可能被污染的LLM获取的答案需审慎对待,是实现用户网络安全的第一道防线。