We formalise recursive self-training in Large Language Models (LLMs) and Generative AI as a discrete-time dynamical system and prove that, as training data become increasingly self-generated ($α_t \to 0$), the system undergoes inevitably degenerative dynamics. We derive two fundamental failure modes: (1) Entropy Decay, where finite sampling effects cause a monotonic loss of distributional diversity (mode collapse), and (2) Variance Amplification, where the loss of external grounding causes the model's representation of truth to drift as a random walk, bounded only by the support diameter. We show these behaviours are not contingent on architecture but are consequences of distributional learning on finite samples. We further argue that Reinforcement Learning with imperfect verifiers suffers similar semantic collapse. To overcome these limits, we propose a path involving symbolic regression and program synthesis guided by Algorithmic Probability. The Coding Theorem Method (CTM) allows for identifying generative mechanisms rather than mere correlations, escaping the data-processing inequality that binds standard statistical learning. We conclude that while purely distributional learning leads to model collapse, hybrid neurosymbolic approaches offer a coherent framework for sustained self-improvement.
翻译:我们将大语言模型(LLM)与生成式人工智能中的递归自训练形式化为离散时间动力系统,并证明当训练数据日益由自生成数据主导($α_t \to 0$)时,该系统不可避免地经历退化动力学。我们推导出两种根本性的失效模式:(1)熵衰减,即有限采样效应导致分布多样性单调损失(模式崩溃);(2)方差放大,即外部锚定信息的缺失导致模型对“真实”的表征以随机游走方式漂移,其边界仅受支撑集直径的限制。我们证明这些行为并非取决于特定架构,而是在有限样本上进行分布学习的必然结果。我们进一步论证,基于不完美验证器的强化学习也会遭受类似的语义崩溃。为克服这些限制,我们提出了一条结合符号回归与程序合成的路径,并以算法概率为指导。编码定理方法(CTM)能够识别生成机制而非仅仅相关性,从而规避了束缚传统统计学习的数据处理不等式。我们的结论是:纯粹的分布学习将导致模型崩溃,而混合神经符号方法为持续自我改进提供了一个连贯的框架。