The robustness of large language models (LLMs) becomes increasingly important as their use rapidly grows in a wide range of domains. Retrieval-Augmented Generation (RAG) is considered as a means to improve the trustworthiness of text generation from LLMs. However, how the outputs from RAG-based LLMs are affected by slightly different inputs is not well studied. In this work, we find that the insertion of even a short prefix to the prompt leads to the generation of outputs far away from factually correct answers. We systematically evaluate the effect of such prefixes on RAG by introducing a novel optimization technique called Gradient Guided Prompt Perturbation (GGPP). GGPP achieves a high success rate in steering outputs of RAG-based LLMs to targeted wrong answers. It can also cope with instructions in the prompts requesting to ignore irrelevant context. We also exploit LLMs' neuron activation difference between prompts with and without GGPP perturbations to give a method that improves the robustness of RAG-based LLMs through a highly effective detector trained on neuron activation triggered by GGPP generated prompts. Our evaluation on open-sourced LLMs demonstrates the effectiveness of our methods.
翻译:随着大语言模型(LLMs)在广泛领域的快速应用,其鲁棒性变得日益重要。检索增强生成(RAG)被视为提升LLMs文本生成可信度的一种手段。然而,基于RAG的LLMs输出如何受到输入微小差异的影响尚未得到充分研究。在本工作中,我们发现,即使在提示前插入一个简短前缀,也会导致生成远离事实正确答案的输出。我们通过引入一种称为梯度引导提示扰动(GGPP)的新型优化技术,系统评估了此类前缀对RAG的影响。GGPP在将基于RAG的LLMs的输出引导至目标错误答案方面取得了高成功率。它还能处理提示中要求忽略无关上下文的指令。此外,我们利用LLMs在受GGPP扰动与未受扰动的提示之间神经元激活的差异,提出了一种方法,通过基于GGPP生成提示触发的神经元激活训练的高效检测器,来提升基于RAG的LLMs的鲁棒性。我们在开源LLMs上的评估证明了我们方法的有效性。