StackTrans：从大语言模型到大下推自动机模型 (StackTrans: From Large Language Model to Large Pushdown Automata Model)

The Transformer architecture has emerged as a landmark advancement within the broad field of artificial intelligence, effectively catalyzing the advent of large language models (LLMs). However, despite its remarkable capabilities and the substantial progress it has facilitated, the Transformer architecture still has some limitations. One such intrinsic limitation is its inability to effectively capture the Chomsky hierarchy, such as regular expressions or deterministic context-free grammars. Drawing inspiration from pushdown automata, which efficiently resolve deterministic context-free grammars using stacks, we propose StackTrans to address the aforementioned issue within LLMs. Unlike previous approaches that modify the attention computation, StackTrans explicitly incorporates hidden state stacks between Transformer layers. This design maintains compatibility with existing frameworks like flash-attention. Specifically, our design features stack operations -- such as pushing and popping hidden states -- that are differentiable and can be learned in an end-to-end manner. Our comprehensive evaluation spans benchmarks for both Chomsky hierarchies and large-scale natural languages. Across these diverse tasks, StackTrans consistently outperforms standard Transformer models and other baselines. We have successfully scaled StackTrans up from 360M to 7B parameters. In particular, our from-scratch pretrained model StackTrans-360M outperforms several larger open-source LLMs with 2-3x more parameters, showcasing its superior efficiency and reasoning capability.

翻译：Transformer架构已成为人工智能领域的一项里程碑式进展，有效推动了大语言模型（LLMs）的出现。然而，尽管其能力卓越且带来了实质性进步，Transformer架构仍存在一定局限性。其中一个固有局限是其无法有效捕捉乔姆斯基层级结构，例如正则表达式或确定性上下文无关文法。受下推自动机（通过栈高效解析确定性上下文无关文法）的启发，我们提出StackTrans以解决LLMs中的上述问题。与以往修改注意力计算的方法不同，StackTrans在Transformer层间显式引入了隐藏状态栈。该设计保持了与现有框架（如flash-attention）的兼容性。具体而言，我们的设计实现了可微分的栈操作（如隐藏状态的压入与弹出），并能以端到端方式学习。我们在乔姆斯基层级和大规模自然语言基准测试上进行了全面评估。在这些多样化任务中，StackTrans始终优于标准Transformer模型及其他基线方法。我们成功将StackTrans的参数量从3.6亿扩展到70亿。特别值得指出的是，我们从头预训练的StackTrans-360M模型在多项任务上超越了参数量大2-3倍的其他开源LLMs，展现了其卓越的效率和推理能力。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

31+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日