It Takes One to Tango but More Make Trouble? In-Context Training with Different Number of Demonstrations

Large language models (LLMs) are capable to perform complex reasoning by in-context learning (ICL) when provided with a few input-output demonstrations (demos) and more powerful when intermediate reasoning steps ("chain of thoughts (CoT)") of the demos are given. Is it necessary to use multi-demo in ICL? In this paper, we study ICL using fewer demos for each test query on the tasks in~\cite{wei2022chain}. Surprisingly, we do not observe significant degradation when using only one randomly chosen demo. To study this phenomenon, for each test query, we categorize demos into "correct demos" leading to the correct answer, and "wrong demos" resulting in wrong answers. Our analysis reveals an inherent bias in those widely studied datasets: most demos are correct for a majority of test queries, which explains the good performance of using one random demo. Moreover, ICL (with and w/o CoT) using only one correct demo significantly outperforms all-demo ICL adopted by most previous works, indicating the weakness of LLMs in finding correct demo(s) for input queries, which is difficult to evaluate on the biased datasets. Furthermore, we observe a counterintuitive behavior of ICL using multi-demo, i.e., its accuracy degrades(improves) when given more correct(wrong) demos. This implies that ICL can be easily misguided by interference among demos and their spurious correlations. Our analyses highlight several fundamental challenges that need to be addressed in LLMs training, ICL, and benchmark design.

翻译：大型语言模型（LLMs）通过上下文学习（ICL）能够执行复杂推理，当提供少量输入-输出演示（demos）时表现如此，而当给出演示的中间推理步骤（即“思维链（CoT）”）时则更为强大。在ICL中是否必须使用多个演示？本文针对文献~\cite{wei2022chain}中的任务，研究了每个测试查询使用较少演示的ICL。令人惊讶的是，当仅使用一个随机选择的演示时，我们并未观察到显著的性能下降。为探究这一现象，针对每个测试查询，我们将演示分为两类：导致正确结果的“正确演示”和导致错误答案的“错误演示”。我们的分析揭示了这些广泛研究的数据集中存在固有偏差：对于大多数测试查询，绝大多数演示都是正确的，这解释了使用一个随机演示仍能取得良好性能的原因。此外，仅使用一个正确演示的ICL（无论是否包含CoT）显著优于先前大多数工作采用的全演示ICL，这表明LLMs在输入查询中寻找正确演示方面存在弱点，而这一弱点在存在偏差的数据集上难以评估。进一步地，我们观察到使用多个演示的ICL呈现反直觉行为：当提供更多正确（错误）演示时，其准确率反而下降（提升）。这意味着ICL容易受到演示间干扰及其虚假相关性的误导。我们的分析揭示了LLM训练、ICL及基准设计中亟需解决的若干根本性挑战。