Large language models (LLMs) are capable to perform complex reasoning by in-context learning (ICL) when provided with a few input-output demonstrations (demos) and more powerful when intermediate reasoning steps ("chain of thoughts (CoT)") of the demos are given. Is it necessary to use multi-demo in ICL? In this paper, we study ICL using fewer demos for each test query on the tasks in~\cite{wei2022chain}. Surprisingly, we do not observe significant degradation when using only one randomly chosen demo. To study this phenomenon, for each test query, we categorize demos into "correct demos" leading to the correct answer, and "wrong demos" resulting in wrong answers. Our analysis reveals an inherent bias in those widely studied datasets: most demos are correct for a majority of test queries, which explains the good performance of using one random demo. Moreover, ICL (with and w/o CoT) using only one correct demo significantly outperforms all-demo ICL adopted by most previous works, indicating the weakness of LLMs in finding correct demo(s) for input queries, which is difficult to evaluate on the biased datasets. Furthermore, we observe a counterintuitive behavior of ICL using multi-demo, i.e., its accuracy degrades(improves) when given more correct(wrong) demos. This implies that ICL can be easily misguided by interference among demos and their spurious correlations. Our analyses highlight several fundamental challenges that need to be addressed in LLMs training, ICL, and benchmark design.
翻译:大型语言模型(LLMs)在提供少量输入-输出演示(demos)时,能够通过上下文学习(ICL)执行复杂推理,若给出演示的中间推理步骤(“思维链(CoT)”),其能力更强。那么,在ICL中是否必须使用多个演示?本文针对~cite{wei2022chain}中的任务,研究每个测试查询使用更少演示的ICL表现。令人惊讶的是,仅使用一个随机选择的演示时,我们并未观察到显著的性能下降。为探究这一现象,我们针对每个测试查询将演示分为“正确演示”(导向正确答案)和“错误演示”(导致错误答案)。分析揭示了这些广泛研究的数据集中存在的固有偏差:大多数演示对于大部分测试查询都是正确的,这解释了使用单个随机演示表现良好的原因。此外,仅使用一个正确演示的ICL(含/不含CoT)显著优于大多数先前工作中采用的全演示ICL,这表明LLMs在输入查询中寻找正确演示的能力存在缺陷,而这一缺陷在存在偏差的数据集中难以评估。进一步地,我们观察到使用多演示的ICL呈现反直觉行为:即当提供更多正确(错误)演示时,其准确率反而下降(提升)。这暗示ICL易受演示间干扰及其伪相关性的误导。我们的分析凸显了在LLMs训练、ICL和基准测试设计中需要解决的若干根本性挑战。