It is well understood that different neural network architectures are suited to different tasks, but is there always a single best architecture for a given task? We compare the expressive power of transformers, RNNs, and transformers with chain of thought tokens on a simple and natural class of tasks we term Compositional Reasoning Questions (CRQ). This family captures multi-step problems with tree-like compositional structure, such as evaluating Boolean formulas. We prove that under standard hardness assumptions, \emph{none} of these three architectures is capable of solving CRQs unless some hyperparameter (depth, embedding dimension, and number of chain of thought tokens, respectively) grows with the size of the input. We then provide constructions for solving CRQs with each architecture. For transformers, our construction uses depth that is logarithmic in the problem size. For RNNs, logarithmic embedding dimension is necessary and sufficient, so long as the inputs are provided in a certain order. For transformers with chain of thought, our construction uses $n$ CoT tokens for input size $n$. These results show that, while CRQs are inherently hard, there are several different ways for language models to overcome this hardness. Even for a single class of problems, each architecture has strengths and weaknesses, and none is strictly better than the others.
翻译:众所周知,不同的神经网络架构适用于不同的任务,但对于给定任务是否总存在单一最优架构?我们在一个简单而自然的任务类别——组合推理问题(CRQ)上,比较了Transformer、RNN以及带有思维链标记的Transformer的表达能力。该任务族捕捉了具有树状组合结构的多步骤问题(例如布尔公式求值)。我们证明在标准计算复杂性假设下,除非特定超参数(分别为网络深度、嵌入维度和思维链标记数量)随输入规模增长,否则这三种架构均无法解决CRQ问题。随后我们为每种架构提供了解决CRQ的构造方案:对于Transformer,我们的构造方案需要深度与问题规模呈对数关系;对于RNN,当输入按特定顺序提供时,对数级嵌入维度是必要且充分的;对于带思维链的Transformer,我们的构造方案对规模为n的输入使用n个思维链标记。这些结果表明,尽管CRQ本质上是困难问题,但语言模型存在多种克服该困难的途径。即使针对单一问题类别,每种架构也各有优劣,不存在绝对优于其他架构的方案。