The robustness of large language models (LLMs) becomes increasingly important as their use rapidly grows in a wide range of domains. Retrieval-Augmented Generation (RAG) is considered as a means to improve the trustworthiness of text generation from LLMs. However, how the outputs from RAG-based LLMs are affected by slightly different inputs is not well studied. In this work, we find that the insertion of even a short prefix to the prompt leads to the generation of outputs far away from factually correct answers. We systematically evaluate the effect of such prefixes on RAG by introducing a novel optimization technique called Gradient Guided Prompt Perturbation (GGPP). GGPP achieves a high success rate in steering outputs of RAG-based LLMs to targeted wrong answers. It can also cope with instructions in the prompts requesting to ignore irrelevant context. We also exploit LLMs' neuron activation difference between prompts with and without GGPP perturbations to give a method that improves the robustness of RAG-based LLMs through a highly effective detector trained on neuron activation triggered by GGPP generated prompts. Our evaluation on open-sourced LLMs demonstrates the effectiveness of our methods.
翻译:随着大语言模型在众多领域的广泛应用,其鲁棒性变得愈发重要。检索增强生成被视为提升大语言模型文本生成可信度的一种方法。然而,基于检索增强生成的大语言模型的输出如何受到略微不同输入的影响,尚未得到充分研究。本工作中,我们发现即使在提示中插入一个简短前缀,也会导致输出偏离事实正确的答案。我们通过引入一种名为梯度引导提示扰动的新型优化技术,系统地评估了此类前缀对检索增强生成的影响。梯度引导提示扰动在引导基于检索增强生成的大语言模型输出到目标错误答案方面达到了很高的成功率,并且能够应对提示中要求忽略无关上下文的指令。我们还利用有无梯度引导提示扰动扰动下提示所引发的大语言模型神经元激活差异,提出了一种通过基于梯度引导提示扰动生成提示所触发的神经元激活训练的高效检测器,来增强基于检索增强生成的大语言模型鲁棒性的方法。我们对开源大语言模型的评估验证了所提方法的有效性。