Plenty of existing work has analyzed the abilities of the transformer architecture by describing its representational capacity with formal models of computation. However, the focus so far has been on analyzing the architecture in terms of language \emph{acceptance}. We contend that this is an ill-suited problem in the study of \emph{language models} (LMs), which are definitionally \emph{probability distributions} over strings. In this paper, we focus on the relationship between transformer LMs and $n$-gram LMs, a simple and historically relevant class of language models. We show that transformer LMs using the hard or sparse attention mechanisms can exactly represent any $n$-gram LM, giving us a concrete lower bound on their probabilistic representational capacity. This provides a first step towards understanding the mechanisms that transformer LMs can use to represent probability distributions over strings.
翻译:已有大量工作通过形式化计算模型描述Transformer架构的表征能力,分析了其能力特性。然而,目前的研究重点主要集中在该架构在语言"接受"方面的分析。我们认为,这对于研究本质上是字符串上"概率分布"的语言模型而言,并非一个合适的问题。本文聚焦于Transformer语言模型与$n$-gram语言模型(一类简单且具有历史意义的基础语言模型)之间的关系。我们证明,采用硬注意力或稀疏注意力机制的Transformer语言模型能够精确表示任意$n$-gram语言模型,从而为Transformer在概率表征能力方面提供了明确的下界。这为理解Transformer语言模型用于表示字符串概率分布的机制迈出了第一步。