Rank Collapse, Fixed Points, and the Renormalization Group Structure of MLP Residual Networks

The analogy between deep neural network forward passes and renormalization group (RG) flows has been repeatedly noted in the literature, but existing treatments remain qualitative: depth is described as a coarse-graining scale, attention is likened to a partition function, and representations are said to flow toward fixed points. No existing work has defined a measurable RG order parameter, tested it under controlled variation of the input distribution, or made quantitative predictions that are empirically verified. We study the simplest architecture for which the analogy is tractable: a pure MLP residual stack trained on masked token prediction over synthetic Markov chain sequences with known spectral properties. We report three findings. (i) The effective rank of the residual stream decreases monotonically with depth after training, consistent with progressive integration of irrelevant degrees of freedom. (ii) This rank collapse is selective: it occurs for chains with short correlation length approximately 1 but is absent for chains with long correlation length approximately 7, measured at the position level to control for mean-pooling artifacts. The network preserves exactly the degrees of freedom relevant to the prediction task, the content of the RG relevance criterion. (iii) Inter-layer kernel drift is concentrated at one or two specific transitions, with the remainder of the network near a fixed point, consistent with a discrete fixed-point plateau. Together these findings constitute the first quantitative, position-level evidence that MLP residual networks implement a selective coarse-graining procedure governed by the spectral structure of the input distribution.

翻译：深度神经网络前向传播与重整化群（RG）流之间的类比在文献中屡被提及，但现有处理仍停留在定性层面：深度被描述为粗粒化尺度，注意力机制被比作配分函数，表征被视为向不动点流动。现有研究既未定义可测量的RG序参量，也未在输入分布受控变化条件下对其进行检验，更未做出可通过实验验证的定量预测。我们研究了该类比易于处理的最简架构：在具有已知谱性质的合成马尔可夫链序列上训练掩码标记预测的纯MLP残差堆叠网络。报告三项发现：（i）训练后残差流的有效秩随深度单调递减，这与无关自由度的渐进整合一致；（ii）这种秩塌缩具有选择性：在相关长度约1的短链上出现，但在相关长度约7的长链上消失（通过位置级测量控制均值池化伪影）。网络恰好保留了预测任务相关的自由度，这正是RG相关性判据的内涵；（iii）层间核漂移集中于一到两个特定过渡层，其余网络部分接近不动点，符合离散不动点平台特征。这些发现共同构成了首个定量且位置级别的证据，表明MLP残差网络实现了由输入分布谱结构调控的选择性粗粒化过程。