Large language models have achieved impressive performance on various natural language processing tasks. However, so far they have been evaluated primarily on benchmarks where all information in the input context is relevant for solving the task. In this work, we investigate the distractibility of large language models, i.e., how the model problem-solving accuracy can be influenced by irrelevant context. In particular, we introduce Grade-School Math with Irrelevant Context (GSM-IC), an arithmetic reasoning dataset with irrelevant information in the problem description. We use this benchmark to measure the distractibility of cutting-edge prompting techniques for large language models, and find that the model performance is dramatically decreased when irrelevant information is included. We also identify several approaches for mitigating this deficiency, such as decoding with self-consistency and adding to the prompt an instruction that tells the language model to ignore the irrelevant information.
翻译:大型语言模型在各种自然语言处理任务上取得了令人瞩目的性能表现。然而,迄今为止,这些模型的评估主要基于输入上下文中的所有信息都与任务解决相关的基准测试。本研究探索了大型语言模型的可干扰性,即无关上下文如何影响模型的问题解决准确率。具体而言,我们引入了带无关上下文的中学数学题(GSM-IC)数据集——一个在问题描述中包含无关信息的算术推理数据集。利用该基准测试,我们评估了针对大型语言模型的前沿提示技术的可干扰性,发现当包含无关信息时,模型性能显著下降。我们还确定了缓解这一缺陷的若干方法,例如采用自一致性解码策略,以及在提示中加入指令,要求语言模型忽略无关信息。