As transformer-based language models are trained on increasingly large datasets and with vast numbers of parameters, finding more efficient alternatives to the standard Transformer has become very valuable. While many efficient Transformers and Transformer alternatives have been proposed, none provide theoretical guarantees that they are a suitable replacement for the standard Transformer. This makes it challenging to identify when to use a specific model and what directions to prioritize for further investigation. In this paper, we aim to understand the capabilities and limitations of efficient Transformers, specifically the Sparse Transformer and the Linear Transformer. We focus on their reasoning capability as exhibited by Chain-of-Thought (CoT) prompts and follow previous works to model them as Dynamic Programming (DP) problems. Our results show that while these models are expressive enough to solve general DP tasks, contrary to expectations, they require a model size that scales with the problem size. Nonetheless, we identify a class of DP problems for which these models can be more efficient than the standard Transformer. We confirm our theoretical results through experiments on representative DP tasks, adding to the understanding of efficient Transformers' practical strengths and weaknesses.
翻译:随着基于Transformer的语言模型在日益庞大的数据集和参数规模下进行训练,寻找标准Transformer的更高效替代方案变得极具价值。尽管已有多种高效Transformer及其替代模型被提出,但均未提供理论保证证明其能完全替代标准Transformer。这使得在实际应用中难以确定何时应采用特定模型,以及应优先探索哪些研究方向。本文旨在探究高效Transformer(特别是稀疏Transformer和线性Transformer)的能力与局限性。我们聚焦于这些模型在思维链提示中展现的推理能力,并沿用先前研究将其建模为动态规划问题。研究结果表明,尽管这些模型具备解决通用动态规划任务的表达能力,但与预期相反,其所需模型规模会随问题规模增长而扩大。然而,我们识别出一类动态规划问题,对于此类问题这些模型确实能比标准Transformer更高效。通过在典型动态规划任务上的实验,我们验证了理论结果,从而深化了对高效Transformer实际优势与局限性的理解。