State-space language models such as Mamba and gated linear attention (GLA) offer efficient alternatives to transformers due to their linear complexity and parallel training, but often lack the expressivity and robust state-tracking needed for complex reasoning. We address these limitations by reframing sequence modelling through a probabilistic lens, using Bayesian filters as a core primitive. While classical filters such as Kalman filters provide principled state estimation and uncertainty tracking, they are typically viewed as inherently sequential. We show that reparameterising the Kalman filter in information form enables its updates to be computed via an associative scan, allowing efficient parallel training. Building on this insight, we introduce the Kalman Linear Attention (KLA) layer, a neural sequence-modelling primitive that performs time-parallel probabilistic inference while maintaining explicit belief-state uncertainty. KLA offers strictly more expressive nonlinear updates and gating than GLA variants while retaining their computational advantages. On language modelling tasks, KLA matches or outperforms modern SSMs and GLAs across representative discrete token-manipulation and state-tracking benchmarks.
翻译:基于状态空间的语音模型(如Mamba和门控线性注意力GLA)因其线性复杂度与并行训练能力,为Transformer提供了高效的替代方案,但在复杂推理任务中常面临表达能力不足与状态跟踪鲁棒性欠缺的问题。本研究通过概率视角重构序列建模,将贝叶斯滤波器作为核心计算单元以应对这些局限。传统滤波器(如卡尔曼滤波器)虽能提供理论严谨的状态估计与不确定性跟踪,但通常被视为固有串行过程。我们证明,将卡尔曼滤波器重参数化为信息形式后,其更新计算可通过关联扫描实现,从而支持高效的并行训练。基于这一发现,我们提出了卡尔曼线性注意力(KLA)层——一种神经序列建模基础单元,该单元在保持显式置信状态不确定性的同时,能够执行时间并行的概率推断。KLA在保留GLA变体计算优势的基础上,提供了严格更强大的非线性更新与门控机制。在语言建模任务中,KLA在代表性离散符号操作与状态跟踪基准测试中,均达到或超越了现代状态空间模型与GLA模型的性能表现。