We describe a family of architectures to support transductive inference by allowing memory to grow to a finite but a-priori unknown bound while making efficient use of finite resources for inference. Current architectures use such resources to represent data either eidetically over a finite span ("context" in Transformers), or fading over an infinite span (in State Space Models, or SSMs). Recent hybrid architectures have combined eidetic and fading memory, but with limitations that do not allow the designer or the learning process to seamlessly modulate the two, nor to extend the eidetic memory span. We leverage ideas from Stochastic Realization Theory to develop a class of models called B'MOJO to seamlessly combine eidetic and fading memory within an elementary composable module. The overall architecture can be used to implement models that can access short-term eidetic memory "in-context," permanent structural memory "in-weights," fading memory "in-state," and long-term eidetic memory "in-storage" by natively incorporating retrieval from an asynchronously updated memory. We show that Transformers, existing SSMs such as Mamba, and hybrid architectures such as Jamba are special cases of B'MOJO and describe a basic implementation, to be open sourced, that can be stacked and scaled efficiently in hardware. We test B'MOJO on transductive inference tasks, such as associative recall, where it outperforms existing SSMs and Hybrid models; as a baseline, we test ordinary language modeling where B'MOJO achieves perplexity comparable to similarly-sized Transformers and SSMs up to 1.4B parameters, while being up to 10% faster to train. Finally, we show that B'MOJO's ability to modulate eidetic and fading memory results in better inference on longer sequences tested up to 32K tokens, four-fold the length of the longest sequences seen during training.
翻译:我们描述了一类支持转导推理的架构,该架构允许内存增长至有限但先验未知的边界,同时高效利用有限资源进行推理。当前架构使用此类资源以两种方式表示数据:要么在有限跨度内全息保持(如Transformer中的“上下文”),要么在无限跨度内逐渐衰减(如状态空间模型或SSM)。近期的混合架构虽然结合了全息记忆与衰减记忆,但存在局限性,既无法让设计者或学习过程无缝调节二者,也无法扩展全息记忆的跨度。我们借鉴随机实现理论的思想,开发了一类名为B'MOJO的模型,可在基本可组合模块内无缝融合全息记忆与衰减记忆。该整体架构可用于实现能够访问以下记忆的模型:通过“上下文”访问短期全息记忆,通过“权重”访问永久结构记忆,通过“状态”访问衰减记忆,并通过原生集成异步更新内存的检索机制,实现“存储”中的长期全息记忆。我们证明Transformer、现有SSM(如Mamba)及混合架构(如Jamba)均为B'MOJO的特例,并描述了一个基础实现方案(将开源),该方案可在硬件中高效堆叠与扩展。我们在转导推理任务(如关联召回)上测试B'MOJO,其性能优于现有SSM与混合模型;作为基线,我们在普通语言建模任务中测试,B'MOJO在参数量达14亿的模型上取得了与同类规模Transformer和SSM相当的困惑度,同时训练速度提升最高达10%。最后,我们展示了B'MOJO调节全息与衰减记忆的能力,使其在长达32K词元(训练所见最长序列长度的四倍)的序列上实现了更优的推理性能。