We investigate the role of various demonstration components in the in-context learning (ICL) performance of large language models (LLMs). Specifically, we explore the impacts of ground-truth labels, input distribution, and complementary explanations, particularly when these are altered or perturbed. We build on previous work, which offers mixed findings on how these elements influence ICL. To probe these questions, we employ explainable NLP (XNLP) methods and utilize saliency maps of contrastive demonstrations for both qualitative and quantitative analysis. Our findings reveal that flipping ground-truth labels significantly affects the saliency, though it's more noticeable in larger LLMs. Our analysis of the input distribution at a granular level reveals that changing sentiment-indicative terms in a sentiment analysis task to neutral ones does not have as substantial an impact as altering ground-truth labels. Finally, we find that the effectiveness of complementary explanations in boosting ICL performance is task-dependent, with limited benefits seen in sentiment analysis tasks compared to symbolic reasoning tasks. These insights are critical for understanding the functionality of LLMs and guiding the development of effective demonstrations, which is increasingly relevant in light of the growing use of LLMs in applications such as ChatGPT. Our research code is publicly available at https://github.com/paihengxu/XICL.
翻译:我们研究了大型语言模型(LLMs)在上下文学习(ICL)中各种示例组件的作用。具体而言,我们探讨了真实标签、输入分布和补充解释的影响,尤其是当这些因素被修改或扰动时。我们基于已有研究展开工作,这些研究对这些要素如何影响ICL提供了混合性结论。为探究这些问题,我们采用可解释自然语言处理(XNLP)方法,并利用对比示例的显著性图进行定性与定量分析。研究结果显示,翻转真实标签会显著影响显著性,但这种影响在更大的LLMs中更为明显。我们对输入分布的细粒度分析表明,在情感分析任务中将情感指示词改为中性词,其影响远小于改变真实标签。最后,我们发现补充解释在提升ICL性能方面的有效性依赖于任务,在情感分析任务中其收益有限,而在符号推理任务中效果更佳。这些见解对于理解LLMs的功能以及指导有效示例的开发至关重要,尤其是在ChatGPT等应用日益广泛的背景下。我们的研究代码已公开于https://github.com/paihengxu/XICL。