Why think step by step? Reasoning emerges from the locality of experience

Humans have a powerful and mysterious capacity to reason. Working through a set of mental steps enables us to make inferences we would not be capable of making directly even though we get no additional data from the world. Similarly, when large language models generate intermediate steps (a chain of thought) before answering a question, they often produce better answers than they would directly. We investigate why and how chain-of-thought reasoning is useful in language models, testing the hypothesis that reasoning is effective when training data consists of overlapping local clusters of variables that influence each other strongly. These training conditions enable the chaining of accurate local inferences to estimate relationships between variables that were not seen together in training. We prove that there will exist a "reasoning gap", where reasoning through intermediate variables reduces bias, for the simple case of an autoregressive density estimator trained on local samples from a chain-structured probabilistic model. We then test our hypothesis experimentally in more complex models, training an autoregressive language model on samples from Bayes nets but only including a subset of variables in each sample. We test language models' ability to match conditional probabilities with and without intermediate reasoning steps, finding that intermediate steps are only helpful when the training data is locally structured with respect to dependencies between variables. The combination of locally structured observations and reasoning is much more data-efficient than training on all variables. Our results illustrate how the effectiveness of reasoning step by step is rooted in the local statistical structure of the training data.

翻译：人类具有强大而神秘的推理能力。通过一系列心理步骤进行思考，使我们能够做出即使没有从世界中获得额外数据也无法直接得出的推论。类似地，当大型语言模型在回答问题前生成中间步骤（思维链）时，它们往往能比直接回答得出更好的答案。我们研究了思维链推理在语言模型中有用性的原因与机制，并测试了以下假设：当训练数据由相互强烈影响的变量之局部重叠簇构成时，推理是有效的。这些训练条件使得能够通过链接准确的局部推理来估计在训练中未曾同时出现的变量之间的关系。我们证明，在基于链式概率模型的局部样本训练的自回归密度估计器的简单情形下，将存在一个“推理间隙”，即通过中间变量进行推理能够减少偏差。接着，我们在更复杂的模型中通过实验检验了该假设：使用贝叶斯网络的样本训练自回归语言模型，但每个样本只包含变量子集。我们测试了语言模型在有/无中间推理步骤情况下匹配条件概率的能力，发现中间步骤仅在训练数据相对于变量间依赖关系具有局部结构时才有帮助。结合局部结构观测与推理的方法，其数据效率远高于对所有变量进行训练。我们的结果阐明了逐步推理的有效性如何植根于训练数据的局部统计结构。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日