In-context learning (ICL) is now a common method for teaching large language models (LLMs) new tasks: given labeled examples in the input context, the LLM learns to perform the task without weight updates. Do models guided via ICL infer the underlying structure of the task defined by the context, or do they rely on superficial heuristics that only generalize to identically distributed examples? We address this question using transformations tasks and an NLI task that assess sensitivity to syntax - a requirement for robust language understanding. We further investigate whether out-of-distribution generalization can be improved via chain-of-thought prompting, where the model is provided with a sequence of intermediate computation steps that illustrate how the task ought to be performed. In experiments with models from the GPT, PaLM, and Llama 2 families, we find large variance across LMs. The variance is explained more by the composition of the pre-training corpus and supervision methods than by model size; in particular, models pre-trained on code generalize better, and benefit more from chain-of-thought prompting.
翻译:上下文学习(ICL)已成为教授大型语言模型(LLMs)执行新任务的常见方法:给定输入上下文中的标注示例,LLM无需更新权重即可学习执行该任务。通过ICL引导的模型是能够推断上下文所定义任务的底层结构,还是仅依赖仅能泛化至同分布示例的表面启发式策略?我们利用转换任务和一项评估句法敏感性的自然语言推理任务(句法敏感性是实现鲁棒语言理解的关键要求)探讨了这一问题。进一步研究通过思维链提示(即向模型提供一系列说明任务执行中间步骤的推理序列)是否能提升分布外泛化能力。在对GPT、PaLM和Llama 2系列模型的实验中,我们发现不同语言模型间存在巨大差异。这种差异主要由预训练语料构成和监督方法而非模型规模决定;具体而言,基于代码预训练的模型泛化能力更强,且受益于思维链提示的程度更高。