Most expressivity results for transformers treat them as language recognizers (which accept or reject strings), and not as they are used in practice, as language models (which generate strings autoregressively and probabilistically). We characterize the probability distributions that transformer language models can express. We show that making transformer language recognizers autoregressive can sometimes increase their expressivity, and that making them probabilistic can break equivalences that hold in the non-probabilistic case. Our overall contribution is to tease apart what functions transformers are capable of expressing, in their most common use-case as language models.
翻译:大多数关于Transformer表达能力的研究将其视为语言识别器(接受或拒绝字符串),而非实践中作为语言模型(以自回归和概率方式生成字符串)的使用方式。本文刻画了Transformer语言模型能够表达的概率分布特征。研究表明:将Transformer语言识别器改为自回归形式有时能提升其表达能力,而引入概率机制可能打破非概率情形下成立的等价关系。本研究的核心贡献在于厘清了Transformer在其最常见应用场景——作为语言模型时——所能表达的函数类别。