B'MOJO: Hybrid State Space Realizations of Foundation Models with Eidetic and Fading Memory

We describe a family of architectures to support transductive inference by allowing memory to grow to a finite but a-priori unknown bound while making efficient use of finite resources for inference. Current architectures use such resources to represent data either eidetically over a finite span ("context" in Transformers), or fading over an infinite span (in State Space Models, or SSMs). Recent hybrid architectures have combined eidetic and fading memory, but with limitations that do not allow the designer or the learning process to seamlessly modulate the two, nor to extend the eidetic memory span. We leverage ideas from Stochastic Realization Theory to develop a class of models called B'MOJO to seamlessly combine eidetic and fading memory within an elementary composable module. The overall architecture can be used to implement models that can access short-term eidetic memory "in-context," permanent structural memory "in-weights," fading memory "in-state," and long-term eidetic memory "in-storage" by natively incorporating retrieval from an asynchronously updated memory. We show that Transformers, existing SSMs such as Mamba, and hybrid architectures such as Jamba are special cases of B'MOJO and describe a basic implementation, to be open sourced, that can be stacked and scaled efficiently in hardware. We test B'MOJO on transductive inference tasks, such as associative recall, where it outperforms existing SSMs and Hybrid models; as a baseline, we test ordinary language modeling where B'MOJO achieves perplexity comparable to similarly-sized Transformers and SSMs up to 1.4B parameters, while being up to 10% faster to train. Finally, we show that B'MOJO's ability to modulate eidetic and fading memory results in better inference on longer sequences tested up to 32K tokens, four-fold the length of the longest sequences seen during training.

翻译：我们描述了一类支持转导推理的架构，该架构允许内存增长至有限但先验未知的边界，同时高效利用有限资源进行推理。当前架构使用此类资源以两种方式表示数据：要么在有限跨度内全息保持（如Transformer中的“上下文”），要么在无限跨度内逐渐衰减（如状态空间模型或SSM）。近期的混合架构虽然结合了全息记忆与衰减记忆，但存在局限性，既无法让设计者或学习过程无缝调节二者，也无法扩展全息记忆的跨度。我们借鉴随机实现理论的思想，开发了一类名为B'MOJO的模型，可在基本可组合模块内无缝融合全息记忆与衰减记忆。该整体架构可用于实现能够访问以下记忆的模型：通过“上下文”访问短期全息记忆，通过“权重”访问永久结构记忆，通过“状态”访问衰减记忆，并通过原生集成异步更新内存的检索机制，实现“存储”中的长期全息记忆。我们证明Transformer、现有SSM（如Mamba）及混合架构（如Jamba）均为B'MOJO的特例，并描述了一个基础实现方案（将开源），该方案可在硬件中高效堆叠与扩展。我们在转导推理任务（如关联召回）上测试B'MOJO，其性能优于现有SSM与混合模型；作为基线，我们在普通语言建模任务中测试，B'MOJO在参数量达14亿的模型上取得了与同类规模Transformer和SSM相当的困惑度，同时训练速度提升最高达10%。最后，我们展示了B'MOJO调节全息与衰减记忆的能力，使其在长达32K词元（训练所见最长序列长度的四倍）的序列上实现了更优的推理性能。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日