We identify and prove a fundamental trade-off governing long-sequence models: no model can simultaneously achieve (i) per-step computation independent of sequence length (Efficiency), (ii) state size independent of sequence length (Compactness), and (iii) the ability to recall a number of historical facts proportional to sequence length (Recall). We formalize this trade-off within an Online Sequence Processor abstraction that unifies Transformers, state space models, linear recurrent networks, and their hybrids. Using the Data Processing Inequality and Fano's Inequality, we prove that any model satisfying Efficiency and Compactness can recall at most O(poly(d)/log V) key-value pairs from a sequence of arbitrary length, where d is the model dimension and V is the vocabulary size. We classify 52 architectures published before March 2026 into the triangle, showing that each achieves at most two of the three properties and that hybrid architectures trace continuous trajectories in the interior. Experiments on synthetic associative recall tasks with five representative architectures validate the theoretical bound: empirical recall capacity lies strictly below the information-theoretic limit, and no architecture escapes the triangle.
翻译:我们识别并证明了制约长序列模型的一个基本权衡:任何模型都无法同时满足(i)每步计算与序列长度无关(高效性)、(ii)状态大小与序列长度无关(紧凑性)、以及(iii)能够回忆与序列长度成正比的若干历史事实(召回性)。我们在一种统一了Transformer、状态空间模型、线性循环网络及其混合体的在线序列处理器抽象中形式化了这一权衡。利用数据处理不等式和Fano不等式,我们证明了任何满足高效性和紧凑性的模型,从任意长度的序列中最能回忆O(poly(d)/log V)个键值对,其中d是模型维度,V是词汇量大小。我们将2026年3月前发表的52种架构分类到该三角形中,表明每种架构最多满足三个性质中的两个,并且混合架构在内部描绘出连续的轨迹。在五种代表性架构的合成关联召回任务上的实验验证了这一理论边界:经验召回容量严格低于信息论极限,没有架构能够逃脱该三角形。