Transformer-based language models (LMs) have achieved widespread empirical success, but their theoretical expressive power remains only partially understood. In this work, we analyze a restricted idealization of fixed-precision transformers with strict future masking, soft attention, and no positional encodings. We establish that this class of models is exactly as expressive as a specific fragment of linear temporal logic that contains only a single temporal operator: the past operator. We further connect this fragment to established classes in formal language theory, automata theory, and algebra, yielding a unified framework for understanding transformer expressivity under this idealization. Finally, we present empirical results that align closely with our theory: transformers trained on languages within their characterized expressive capacity generalize reliably across sequence lengths, while they consistently fail to generalize on languages beyond it.
翻译:基于Transformer的语言模型(LMs)已取得广泛的实证成功,但其理论表达能力仍仅得到部分理解。本研究分析了一类受限理想化的固定精度Transformer模型,其具备严格的未来掩码机制、软注意力机制且不含位置编码。我们证明此类模型在表达能力上精确等价于仅包含单一时序算子——过去算子——的线性时序逻辑特定片段。我们进一步将该片段与形式语言理论、自动机理论及代数中的已确立类别相联系,从而为理解此理想化条件下Transformer的表达能力提供了一个统一框架。最后,我们呈现了与理论高度吻合的实证结果:在模型刻画表达能力范围内的语言上训练的Transformer模型,能够可靠地跨序列长度泛化;而对于超出该范围的语言,则始终无法实现泛化。