Large language models (LLMs) are capable to perform complex reasoning by in-context learning (ICL) when provided with a few input-output demonstrations (demos) and more powerful when intermediate reasoning steps ("chain of thoughts (CoT)") of the demos are given. Is it necessary to use multi-demo in ICL? In this paper, we study ICL using fewer demos for each test query on the tasks in~\cite{wei2022chain}. Surprisingly, we do not observe significant degradation when using only one randomly chosen demo. To study this phenomenon, for each test query, we categorize demos into "correct demos" leading to the correct answer, and "wrong demos" resulting in wrong answers. Our analysis reveals an inherent bias in those widely studied datasets: most demos are correct for a majority of test queries, which explains the good performance of using one random demo. Moreover, ICL (with and w/o CoT) using only one correct demo significantly outperforms all-demo ICL adopted by most previous works, indicating the weakness of LLMs in finding correct demo(s) for input queries, which is difficult to evaluate on the biased datasets. Furthermore, we observe a counterintuitive behavior of ICL using multi-demo, i.e., its accuracy degrades(improves) when given more correct(wrong) demos. This implies that ICL can be easily misguided by interference among demos and their spurious correlations. Our analyses highlight several fundamental challenges that need to be addressed in LLMs training, ICL, and benchmark design.
翻译:大语言模型(LLMs)能够通过上下文学习(ICL)进行复杂推理,仅需提供少量输入-输出示例(demo),而当给出中间推理步骤(如"思维链(CoT)")时推理能力更强。在ICL中是否必须使用多个示例?本文针对~\cite{wei2022chain}中的任务,研究每个测试查询使用较少示例的ICL。令人惊讶的是,当仅使用一个随机选择的示例时,我们并未观察到显著的性能下降。为探究此现象,我们对每个测试查询将示例分为两类:能得出正确答案的"正确示例"和导致错误答案的"错误示例"。分析揭示了这些广泛研究的数据集存在固有偏差:大多数示例对大多数测试查询都是正确的,这解释了使用单个随机示例性能良好的原因。此外,仅使用一个正确示例的ICL(含/不含CoT)显著优于大多数先前工作采用的全示例ICL,这表明LLMs在为输入查询寻找正确示例方面存在不足——而这种缺陷很难在存在偏差的数据集上评估。进一步地,我们观察到多示例ICL的反直觉行为:当提供更多正确(错误)示例时,其准确率反而下降(上升)。这表明ICL容易受到示例间干扰及其虚假相关性的误导。我们的分析揭示了LLMs训练、ICL及基准测试设计中亟需解决的若干根本性挑战。