Existing work has analyzed the representational capacity of the transformer architecture by means of formal models of computation. However, the focus so far has been on analyzing the architecture in terms of language \emph{acceptance}. We contend that this is an ill-suited problem in the study of \emph{language models} (LMs), which are definitionally \emph{probability distributions} over strings. In this paper, we focus on the relationship between transformer LMs and $n$-gram LMs, a simple and historically relevant class of language models. We show that transformer LMs using the hard or sparse attention mechanisms can exactly represent any $n$-gram LM, giving us a concrete lower bound on their probabilistic representational capacity. This provides a first step towards understanding the mechanisms that transformer LMs can use to represent probability distributions over strings.
翻译:现有研究通过形式计算模型分析了 Transformer 架构的表示能力。然而,迄今为止的关注点主要集中在基于语言\emph{接受性}来分析该架构。我们认为,这对于研究\emph{语言模型}(LMs)而言是一个不恰当的问题,因为语言模型在定义上是字符串上的\emph{概率分布}。在本文中,我们聚焦于 Transformer LMs 与 $n$-gram LMs(一类简单且具有历史相关性的语言模型)之间的关系。我们证明,使用硬注意力或稀疏注意力机制的 Transformer LMs 能够精确表示任何 $n$-gram LM,这为它们的概率表示能力提供了一个具体的下界。这为理解 Transformer LMs 可用于表示字符串上概率分布的机制迈出了第一步。