As transformer-based language models are trained on increasingly large datasets and with vast numbers of parameters, finding more efficient alternatives to the standard Transformer has become very valuable. While many efficient Transformers and Transformer alternatives have been proposed, none provide theoretical guarantees that they are a suitable replacement for the standard Transformer. This makes it challenging to identify when to use a specific model and what directions to prioritize for further investigation. In this paper, we aim to understand the capabilities and limitations of efficient Transformers, specifically the Sparse Transformer and the Linear Transformer. We focus on their reasoning capability as exhibited by Chain-of-Thought (CoT) prompts and follow previous works to model them as Dynamic Programming (DP) problems. Our results show that while these models are expressive enough to solve general DP tasks, contrary to expectations, they require a model size that scales with the problem size. Nonetheless, we identify a class of DP problems for which these models can be more efficient than the standard Transformer. We confirm our theoretical results through experiments on representative DP tasks, adding to the understanding of efficient Transformers' practical strengths and weaknesses.
翻译:随着基于Transformer的语言模型在日益庞大的数据集和大量参数下进行训练,寻找标准Transformer的更高效替代方案变得极具价值。虽然已有许多高效的Transformer及其替代模型被提出,但没有任何模型具备理论保证,能够证明它们适合替代标准Transformer。这使得我们难以确定何时使用特定模型,以及应优先探索哪些方向。本文旨在理解高效Transformer(特别是Sparse Transformer和Linear Transformer)的能力与局限性。我们关注它们通过思维链提示展现的推理能力,并沿用先前研究将其建模为动态规划问题。结果表明,尽管这些模型具有足够的表达能力来解决通用DP任务,但与预期相反,它们所需要的模型规模会随问题规模扩大而增长。不过,我们识别出了一类DP问题,对于这些问题,这些模型可以比标准Transformer更高效。我们通过在代表性DP任务上的实验验证了理论结果,从而加深了对高效Transformer实际优缺点的理解。