Most expressivity results for transformers treat them as language recognizers (which accept or reject strings), and not as they are used in practice, as language models (which generate strings autoregressively and probabilistically). Here, we characterize the probability distributions that transformer language models can express. We show that making transformer language recognizers autoregressive can sometimes increase their expressivity, and that making them probabilistic can break equivalences that hold in the non-probabilistic case. Our overall contribution is to tease apart what functions transformers are capable of expressing, in their most common use-case as language models.
翻译:大多数关于Transformer表达能力的理论结果将其视为语言识别器(接受或拒绝字符串),而非实际应用中作为语言模型(以自回归和概率方式生成字符串)使用。本文旨在刻画Transformer语言模型所能表达的概率分布特征。研究表明,将Transformer语言识别器改为自回归形式有时能增强其表达能力,而引入概率机制可能打破非概率情形下成立的等价关系。本研究的核心贡献在于厘清Transformer在其最常见应用场景——作为语言模型时——所能表达的函数类别。