Modern sequence modeling is dominated by two families: Transformers, whose self-attention can access arbitrary elements of the visible sequence, and structured state-space models, which propagate information through an explicit recurrent state. These mechanisms face different limitations on long contexts: when attention is diffuse, the influence of individual tokens is diluted across the effective support, while recurrent state propagation can lose long-range sensitivity unless information is actively preserved. As a result, both mechanisms face challenges in preserving and selectively retrieving information over long contexts. We propose Sessa, a decoder that places attention inside a recurrent feedback path. This creates many attention-based paths through which past tokens can influence future states, rather than relying on a single attention read or a single recurrent chain. We prove that, under explicit assumptions and matched regimes, Sessa admits power-law memory tails $O(\ell^{-β})$ for $0 < β< 1$, with slower decay than in the corresponding Transformer and Mamba-style baselines. We further give an explicit construction that achieves this power-law rate. Under the same assumptions, Sessa is the only model class among those considered that realizes flexible selective retrieval, including profiles whose influence does not decay with distance. Consistent with this theoretical advantage, across matched experiments, Sessa achieves the strongest performance on long-context benchmarks while remaining competitive with Transformer and Mamba-style baselines on short-context language modeling.
翻译:论文摘要:现代序列建模主要分为两大范式:一类是Transformer,其自注意力机制可访问可见序列中的任意元素;另一类是结构化状态空间模型,通过显式的循环状态传播信息。这两种机制在处理长上下文时面临不同局限:当注意力分布分散时,单个词元的贡献会被有效支持域稀释;而循环状态传播除非信息被主动保留,否则会丧失长程敏感性。因此,这两种机制在长上下文场景下均难以有效保留和选择性检索信息。本文提出Sessa解码器,通过在循环反馈路径中嵌入注意力机制,构建了多条基于注意力的信息通道,使得历史词元可通过多条路径影响未来状态,而非依赖单次注意力读取或单条循环链。我们证明,在显式假设与匹配条件下,Sessa具有幂律记忆尾部分布$O(\ell^{-β})$(其中$0 < β< 1$),其衰减速度慢于对应的Transformer和Mamba式基线模型,并给出了实现该幂律速率的显式构造。基于相同假设,Sessa是所考虑模型类中唯一实现灵活选择性检索的模型,其影响曲线不会随距离增加而衰减。与这一理论优势一致,在对比实验中,Sessa在长上下文基准测试中取得了最优性能,同时在短上下文语言建模中与Transformer和Mamba式基线保持竞争力。