Autoregressive language models have demonstrated a remarkable ability to extract latent structure from text. The embeddings from large language models have been shown to capture aspects of the syntax and semantics of language. But what {\em should} embeddings represent? We connect the autoregressive prediction objective to the idea of constructing predictive sufficient statistics to summarize the information contained in a sequence of observations, and use this connection to identify three settings where the optimal content of embeddings can be identified: independent identically distributed data, where the embedding should capture the sufficient statistics of the data; latent state models, where the embedding should encode the posterior distribution over states given the data; and discrete hypothesis spaces, where the embedding should reflect the posterior distribution over hypotheses given the data. We then conduct empirical probing studies to show that transformers encode these three kinds of latent generating distributions, and that they perform well in out-of-distribution cases and without token memorization in these settings.
翻译:自回归语言模型已展现出从文本中提取潜在结构的卓越能力。大型语言模型的嵌入被证明能够捕捉语言的句法和语义特征。但嵌入究竟应表示什么?我们将自回归预测目标与构建预测充分统计量的思想联系起来,后者旨在总结观测序列中的信息,并利用这一关联识别出三种可确定嵌入最优内容的场景:独立同分布数据(嵌入应捕捉数据的充分统计量)、潜在状态模型(嵌入应编码给定数据后状态的后验分布)以及离散假设空间(嵌入应反映给定数据后假设的后验分布)。我们通过实证探测研究表明,Transformer模型能够编码这三类潜在生成分布,且在这些场景下表现出良好的分布外泛化能力,同时避免了对标记的机械记忆。