Dissecting Chain-of-Thought: Compositionality through In-Context Filtering and Learning

from arxiv, Accepted for NeurIPS 2023. Changes in this version: refined title, restructured content, included new out-of-distribution experiments, and code now available

Chain-of-thought (CoT) is a method that enables language models to handle complex reasoning tasks by decomposing them into simpler steps. Despite its success, the underlying mechanics of CoT are not yet fully understood. In an attempt to shed light on this, our study investigates the impact of CoT on the ability of transformers to in-context learn a simple to study, yet general family of compositional functions: multi-layer perceptrons (MLPs). In this setting, we find that the success of CoT can be attributed to breaking down in-context learning of a compositional function into two distinct phases: focusing on and filtering data related to each step of the composition and in-context learning the single-step composition function. Through both experimental and theoretical evidence, we demonstrate how CoT significantly reduces the sample complexity of in-context learning (ICL) and facilitates the learning of complex functions that non-CoT methods struggle with. Furthermore, we illustrate how transformers can transition from vanilla in-context learning to mastering a compositional function with CoT by simply incorporating additional layers that perform the necessary data-filtering for CoT via the attention mechanism. In addition to these test-time benefits, we show CoT helps accelerate pretraining by learning shortcuts to represent complex functions and filtering plays an important role in this process. These findings collectively provide insights into the mechanics of CoT, inviting further investigation of its role in complex reasoning tasks.

翻译：思维链（CoT）是一种通过将复杂推理任务分解为简单步骤来增强语言模型处理能力的方法。尽管该方法效果显著，但其底层机制尚未完全明晰。为阐明这一问题，本研究探讨了CoT对Transformer在上下文中学习一类简单但具有普适性的组合函数——多层感知机（MLPs）——能力的影响。在此设定下，我们发现CoT的成功可归因于将组合函数的上下文学习分解为两个明确阶段：聚焦并过滤与组合每步相关的数据，以及上下文学习单步组合函数。通过实验与理论证据，我们展示了CoT如何显著降低上下文学习（ICL）的样本复杂度，并促进非CoT方法难以处理的复杂函数学习。进一步地，我们说明了Transformer如何通过增加执行注意力机制驱动的必要数据过滤层，从标准上下文学习过渡到借助CoT掌握组合函数。除测试阶段优势外，我们还表明CoT通过学习表示复杂函数的捷径加速预训练，而数据过滤在此过程中发挥关键作用。这些发现共同揭示了CoT的运作机制，为深入探究其在复杂推理任务中的作用提供了新视角。