Despite the unprecedented empirical triumphs of LLMs across diverse real-world applications, the prevailing research paradigm remains overwhelmingly heuristic and experimentally driven, inextricably tethered to astronomical computational resources and massive data regimes. A rigorous theoretical elucidation of LLMs -- their foundational "first principles" -- remains profoundly elusive. To systematically dismantle this epistemological black box, this treatise architects a comprehensive *semantic information theory*, rigorously synthesized from the profound intersections of statistical physics, continuous signal processing, and classical information theory. The cardinal axiom of our theoretical framework is a fundamental ontological paradigm shift: transcending the classical *BIT* -- a microscopic substrate entirely devoid of semantic content -- in favor of the macroscopic *TOKEN* as the irreducible atomic carrier of meaning and reasoning. Ultimately, this unified theoretical edifice not only comprehensively demystifies the generative mechanics and emergent causal capabilities of LLMs but also establishes an impregnable mathematical scaffold to guide all future theoretical inquiries and next-generation architectural paradigms.
翻译:尽管大语言模型在各类实际应用中取得了前所未有的经验性成功,但当前的主流研究范式仍以启发式方法和实验驱动为主,严重依赖天文级计算资源与海量数据体系。对大语言模型严谨的理论阐释——即其根本性的"第一性原理"——仍然极其难以捉摸。为系统性地解构这一认识论黑箱,本文构建了一套全面的*语义信息论*,该理论严格融合了统计物理、连续信号处理与经典信息论的深刻交叉点。我们理论框架的核心公理是一项根本性的本体论范式转换:超越经典的*比特*——这一完全不含语义内容的微观基元——转而采用宏观的*词元*作为意义与推理的不可约原子载体。最终,这一统一的理论体系不仅全面揭示了大语言模型的生成机制与涌现因果能力,也为指导未来所有理论探索与下一代架构范式奠定了坚不可摧的数学基础。