Process-Tensor Tomography of SGD: Measuring Non-Markovian Memory via Back-Flow of Distinguishability

from arxiv, to be published in the 29th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research

This work proposes neural training as a \emph{process tensor}: a multi-time map that takes a sequence of controllable instruments (batch choices, augmentations, optimizer micro-steps) and returns an observable of the trained model. Building on this operational lens, we introduce a simple, model-agnostic witness of training memory based on \emph{back-flow of distinguishability}. In a controlled two-step protocol, we compare outcome distributions after one intervention versus two; the increase $Δ_{\mathrm{BF}} = D_2 - D_1>0$ (with $D\in\{\mathrm{TV}, \mathrm{JS}, \mathrm{H}\}$ measured on softmax predictions over a fixed probe set) certifies non-Markovianity. We observe consistent positive back-flow with tight bootstrap confidence intervals, amplification under higher momentum, larger batch overlap, and more micro-steps, and collapse under a \emph{causal break} (resetting optimizer state), directly attributing the effect to optimizer/data-state memory. The witness is robust across TV/JS/Hellinger, inexpensive to compute, and requires no architectural changes. We position this as a \emph{measurement} contribution: a principled diagnostic and empirical evidence that practical SGD deviates from the Markov idealization. An exploratory case study illustrates how the micro-level signal can inform curriculum orderings. "Data order matters" turns into a testable operator with confidence bounds, our framework offers a common stage to compare optimizers, curricula, and schedules through their induced training memory.

翻译：本研究将神经训练建模为一种\emph{过程张量}：一种多时间映射，它以可控操作序列（批次选择、数据增强、优化器微步）为输入，并输出训练后模型的可观测指标。基于此操作视角，我们提出了一种基于\emph{可区分性回流}的简单、模型无关的训练记忆检测方法。在受控的两步协议中，我们比较单次干预与两次干预后的输出分布；当在固定探测集上通过softmax预测测得的$Δ_{\mathrm{BF}} = D_2 - D_1>0$（其中$D\in\{\mathrm{TV}, \mathrm{JS}, \mathrm{H}\}$）时，即可证明非马尔可夫性的存在。我们观察到持续的正向回流现象，其自举置信区间紧凑，且在较高动量、较大批次重叠和更多微步数条件下会增强，而在\emph{因果中断}（重置优化器状态）条件下会消失，这直接证明了该效应源于优化器/数据状态记忆。该检测方法对TV/JS/Hellinger距离度量均具有鲁棒性，计算成本低廉，且无需改变模型架构。我们将此工作定位为一种\emph{测量}贡献：它提供了原则性诊断工具和实证证据，表明实际SGD训练过程偏离了马尔可夫理想化假设。一项探索性案例研究展示了微观层面的信号如何指导课程排序设计。"数据顺序至关重要"这一命题由此转化为具有置信边界的可检验算子，我们的框架为通过训练记忆比较优化器、课程策略与调度方案提供了统一的分析平台。