Statistical learning under distributional drift remains insufficiently characterized: when each observation alters the data-generating law, classical generalization bounds can collapse. We introduce a new statistical primitive, the reproducibility budget $C_T$, which quantifies a system's finite capacity for statistical reproducibility: the extent to which its sampling process can remain governed by a consistent underlying distribution in the presence of both exogenous change and endogenous feedback. Formally, $C_T$ is defined as the cumulative Fisher-Rao path length of the coupled learner-environment evolution, measuring the total distributional motion accumulated during learning. From this construct we derive a drift-feedback generalization bound of order $O(T^{-1/2} + C_T/T)$, and we prove a matching minimax lower bound showing that this rate is minimax-optimal. Consequently, the results establish a reproducibility speed limit: no algorithm can achieve smaller worst-case generalization error than that imposed by the average Fisher-Rao drift rate $C_T/T$ of the data-generating process. The framework situates exogenous drift, adaptive data analysis, and performative prediction within a common geometric structure, with $C_T$ emerging as the intrinsic quantity measuring distributional motion across these settings.
翻译:分布漂移下的统计学习仍缺乏充分刻画:当每个观测都会改变数据生成规律时,经典泛化界可能失效。我们引入一种新的统计原语——可复现性预算$C_T$,用以量化系统有限的统计可复现能力:即在存在外生变化与内生反馈的情况下,其抽样过程能在多大程度上保持由一致底层分布所支配。形式化地,$C_T$定义为耦合学习者-环境演化的累积Fisher-Rao路径长度,用于度量学习过程中积累的总分布运动量。基于此构造,我们推导出阶数为$O(T^{-1/2} + C_T/T)$的漂移-反馈泛化界,并证明与之匹配的极小极大下界,表明该速率是极小极大最优的。因此,这些结果确立了一个可复现性速度极限:任何算法都无法实现比数据生成过程的平均Fisher-Rao漂移率$C_T/T$所强加的泛化误差更小的最坏情况泛化误差。该框架将外生漂移、自适应数据分析和执行预测置于统一的几何结构中,而$C_T$则作为衡量这些场景中分布运动的内在量涌现出来。