The Transformer architecture has emerged as a landmark advancement within the broad field of artificial intelligence, effectively catalyzing the advent of large language models (LLMs). However, despite its remarkable capabilities and the substantial progress it has facilitated, the Transformer architecture still has some limitations. One such intrinsic limitation is its inability to effectively capture the Chomsky hierarchy, such as regular expressions or deterministic context-free grammars. Drawing inspiration from pushdown automata, which efficiently resolve deterministic context-free grammars using stacks, we propose StackTrans to address the aforementioned issue within LLMs. Unlike previous approaches that modify the attention computation, StackTrans explicitly incorporates hidden state stacks between Transformer layers. This design maintains compatibility with existing frameworks like flash-attention. Specifically, our design features stack operations -- such as pushing and popping hidden states -- that are differentiable and can be learned in an end-to-end manner. Our comprehensive evaluation spans benchmarks for both Chomsky hierarchies and large-scale natural languages. Across these diverse tasks, StackTrans consistently outperforms standard Transformer models and other baselines. We have successfully scaled StackTrans up from 360M to 7B parameters. In particular, our from-scratch pretrained model StackTrans-360M outperforms several larger open-source LLMs with 2-3x more parameters, showcasing its superior efficiency and reasoning capability.
翻译:Transformer架构已成为人工智能领域的一项里程碑式进展,有效推动了大语言模型(LLMs)的出现。然而,尽管其能力卓越且带来了实质性进步,Transformer架构仍存在一定局限性。其中一个固有局限是其无法有效捕捉乔姆斯基层级结构,例如正则表达式或确定性上下文无关文法。受下推自动机(通过栈高效解析确定性上下文无关文法)的启发,我们提出StackTrans以解决LLMs中的上述问题。与以往修改注意力计算的方法不同,StackTrans在Transformer层间显式引入了隐藏状态栈。该设计保持了与现有框架(如flash-attention)的兼容性。具体而言,我们的设计实现了可微分的栈操作(如隐藏状态的压入与弹出),并能以端到端方式学习。我们在乔姆斯基层级和大规模自然语言基准测试上进行了全面评估。在这些多样化任务中,StackTrans始终优于标准Transformer模型及其他基线方法。我们成功将StackTrans的参数量从3.6亿扩展到70亿。特别值得指出的是,我们从头预训练的StackTrans-360M模型在多项任务上超越了参数量大2-3倍的其他开源LLMs,展现了其卓越的效率和推理能力。