We investigate the role of various demonstration components in the in-context learning (ICL) performance of large language models (LLMs). Specifically, we explore the impacts of ground-truth labels, input distribution, and complementary explanations, particularly when these are altered or perturbed. We build on previous work, which offers mixed findings on how these elements influence ICL. To probe these questions, we employ explainable NLP (XNLP) methods and utilize saliency maps of contrastive demonstrations for both qualitative and quantitative analysis. Our findings reveal that flipping ground-truth labels significantly affects the saliency, though it's more noticeable in larger LLMs. Our analysis of the input distribution at a granular level reveals that changing sentiment-indicative terms in a sentiment analysis task to neutral ones does not have as substantial an impact as altering ground-truth labels. Finally, we find that the effectiveness of complementary explanations in boosting ICL performance is task-dependent, with limited benefits seen in sentiment analysis tasks compared to symbolic reasoning tasks. These insights are critical for understanding the functionality of LLMs and guiding the development of effective demonstrations, which is increasingly relevant in light of the growing use of LLMs in applications such as ChatGPT. Our research code is publicly available at https://github.com/paihengxu/XICL.
翻译:我们研究了各种示例组件在大语言模型(LLMs)上下文学习(ICL)性能中的作用。具体而言,我们探讨了真实标签、输入分布以及补充解释的影响,尤其是在这些要素被修改或扰动时。我们基于先前研究(这些研究对这些要素如何影响ICL提出了不同结论)展开分析。为了探究这些问题,我们采用了可解释NLP(XNLP)方法,并利用对比示例的显著性图进行定性和定量分析。研究结果表明,翻转真实标签显著影响显著性,但在较大规模的LLMs中更为明显。我们对输入分布的细粒度分析显示,在情感分析任务中,将情感指示词替换为中性词的影响不如改变真实标签显著。最后,我们发现补充解释在提升ICL性能方面的有效性取决于具体任务:与符号推理任务相比,情感分析任务中其益处有限。这些见解对于理解LLMs的功能及指导有效示例的开发至关重要,尤其在ChatGPT等应用中LLMs的使用日益广泛的背景下。我们的研究代码已公开于https://github.com/paihengxu/XICL。