Recent advances in prompt engineering enable large language models (LLMs) to solve multi-hop logical reasoning problems with impressive accuracy. However, there is little existing work investigating the robustness of LLMs with few-shot prompting techniques. Therefore, we introduce a systematic approach to test the robustness of LLMs in multi-hop reasoning tasks via domain-agnostic perturbations. We include perturbations at multiple levels of abstractions (e.g. lexical perturbations such as typos, and semantic perturbations such as the inclusion of intermediate reasoning steps in the questions) to conduct behavioral analysis on the LLMs. Throughout our experiments, we find that models are more sensitive to certain perturbations such as replacing words with their synonyms. We also demonstrate that increasing the proportion of perturbed exemplars in the prompts improves the robustness of few-shot prompting methods.
翻译:近期提示工程的进展使大语言模型能够以令人瞩目的准确率解决多跳逻辑推理问题。然而,目前鲜有研究探讨采用少样本提示技术的大语言模型鲁棒性。为此,我们提出一种通过领域无关扰动系统测试大语言模型在多跳推理任务中鲁棒性的方法。我们引入多个抽象层次的扰动(如拼写错误等词汇层面扰动,以及问题中嵌入中间推理步骤等语义层面扰动),对大语言模型进行行为分析。通过系列实验发现,模型对某些扰动(如同义词替换)更为敏感。同时,我们证明增加提示中扰动样本的比例能够提升少样本提示方法的鲁棒性。