We study the probabilistic modeling performed by Autoregressive Large Language Models through the angle of time directionality. We empirically find a time asymmetry exhibited by such models in their ability to model natural language: a difference in the average log-perplexity when trying to predict the next token versus when trying to predict the previous one. This difference is at the same time subtle and very consistent across various modalities (language, model size, training time, ...). Theoretically, this is surprising: from an information-theoretic point of view, there should be no such difference. We provide a theoretical framework to explain how such an asymmetry can appear from sparsity and computational complexity considerations, and outline a number of perspectives opened by our results.
翻译:我们从时间方向性的角度研究了自回归大型语言模型所执行的概率建模。通过实证分析,我们发现这类模型在建模自然语言时表现出时间不对称性:即预测下一个词元与预测前一个词元时,平均对数困惑度存在差异。这种差异虽细微,但在语言、模型规模、训练时间等多种模态下均表现出高度一致性。从理论层面看,这一现象令人惊讶:基于信息论观点,此类差异本不应存在。我们提出了一个理论框架,从稀疏性和计算复杂性的角度解释了这种不对称性如何产生,并概述了研究结果所开辟的若干新视角。