遗忘比特，聚焦词元：面向大语言模型的语义信息论 (Forget BIT, It is All about TOKEN: Towards Semantic Information Theory for LLMs)

Large language models (LLMs) have demonstrated remarkable capabilities in numerous real-world applications. While the vast majority of research conducted from an experimental perspective is progressing rapidly, it demands substantial computational power, data, and other resources. Therefore, how to open the black-box of LLMs from a theoretical standpoint has become a critical challenge. This paper takes the theory of rate-distortion function, directed information, and Granger causality as its starting point to investigate the information-theoretic principles behind LLMs, leading to the development of semantic information theory for LLMs, where the fundamental unit is token, rather than bits that lacks any semantic meaning. By defining the probabilistic model of LLMs, we discuss structure-agnostic information-theoretic measures, such as the directed rate-distortion function in pre-training, the directed rate-reward function in post-training, and the semantic information flow in inference phase. This paper also delves deeply into the theory of token-level semantic embedding and the information-theoretically optimal vectorization method. Thereafter, we propose a general definition of autoregression LLM, where the Transformer architecture and its performance such as ELBO, generalization error bound, memory capacity, and semantic information measures can be derived theoretically. Other architectures, such as Mamba/Mamba2 and LLaDA, are also discussed in our framework. Consequently, this paper provides a theoretical framework for understanding LLMs from the perspective of semantic information theory, which also offers the necessary theoretical tools for further in-depth research.

翻译：大语言模型（LLM）在众多实际应用中展现出卓越能力。尽管从实验视角开展的研究正飞速进展，但其需要大量的计算能力、数据及其他资源。因此，如何从理论层面揭开大语言模型的黑箱已成为一项关键挑战。本文以率失真函数理论、定向信息与格兰杰因果性为出发点，探究大语言模型背后的信息论原理，从而发展出以大语言模型为对象的语义信息论，其基本单元是词元（token），而非缺乏任何语义的比特（bit）。通过定义大语言模型的概率模型，我们讨论了与结构无关的信息论度量，例如预训练中的定向率失真函数、后训练中的定向率-奖励函数，以及推理阶段的语义信息流。本文还深入探讨了词元级语义嵌入理论及信息论意义下的最优向量化方法。随后，我们提出了自回归大语言模型的一般定义，其中Transformer架构及其性能指标，如证据下界（ELBO）、泛化误差界、记忆容量和语义信息度量，均可从理论上推导得出。其他架构，如Mamba/Mamba2和LLaDA，也在我们的框架中进行了讨论。因此，本文为从语义信息论视角理解大语言模型提供了一个理论框架，同时也为进一步深入研究提供了必要的理论工具。