The two clocks and the innovation window: When and how generative models learn rules

from arxiv, 48 pages, 28 figures. Earlier versions are presented in NeurIPS2025 SPIGM workshop as oral presentation https://openreview.net/forum?id=LjqX8OhPPi

Generative models trained on finite data face a fundamental tension: their score-matching or next-token objective converges to the empirical training distribution rather than the population distribution we seek to learn. Using rule-valid synthetic tasks, we trace this tension across two training timescales: $τ_{\mathrm{rule}}$, the step at which generations first become rule-valid, and $τ_{\mathrm{mem}}$, the step at which models begin reproducing training samples. Focusing on parity and extending to other binary rules and combinatorial puzzles, we characterize how these two clocks, $τ_{\mathrm{rule}}$ and $τ_{\mathrm{mem}}$, depend on key aspects of the learning setup. Specifically, we show that $τ_{\mathrm{rule}}$ increases with rule complexity and decreases with model capacity, while $τ_{\mathrm{mem}}$ is approximately invariant to the rule and scales nearly linearly with dataset size $N$. We define the \emph{innovation window} as the interval $[τ_{\mathrm{rule}}, τ_{\mathrm{mem}}]$. This window widens with increasing $N$ and narrows with rule complexity, and may vanish entirely when $τ_{\mathrm{rule}} \geq τ_{\mathrm{mem}}$. The same two-clock structure arises in both diffusion (DiT) and autoregressive (GPT) models, with architecture-dependent offsets. Dissecting the learned score of DiT models reveals a corresponding evolution of the optimization landscapes, where rule-valid samples' basins expand substantially around $τ_{\mathrm{rule}}$, while training samples' basins begin to dominate around $τ_{\mathrm{mem}}$. Together, these results yield a unified and predictive account of when and how generative models exhibit genuine innovation.

翻译：在有限数据上训练的生成模型面临一个根本性张力：其分数匹配或下一个词元目标收敛于经验训练分布，而非我们试图学习的总体分布。利用规则验证的合成任务，我们沿两个训练时间尺度追踪这一张力：τ_rule（生成首次具备规则验证性的步数）和τ_mem（模型开始复现训练样本的步数）。聚焦于奇偶校验并拓展至其他二元规则及组合谜题，我们刻画了这两个时钟τ_rule和τ_mem如何依赖于学习设置的关键因素。具体而言，我们证明τ_rule随规则复杂度增加而增加、随模型容量增大而减小，而τ_mem近似与规则无关且与数据集规模N呈近线性关系。我们将创新窗口定义为区间[τ_rule, τ_mem]。该窗口随N增大而拓宽、随规则复杂度增大而收窄，并在τ_rule ≥ τ_mem时完全消失。相同的双时钟结构出现在扩散模型(DiT)和自回归模型(GPT)中，并呈现架构依赖的偏移。剖析DiT模型学习到的分数函数，揭示了优化景观的相应演变：规则验证样本的吸引域在τ_rule附近显著扩展，而训练样本的吸引域在τ_mem附近开始占据主导。这些结果共同为生成模型何时及如何展现真正创新提供了统一且可预测的解释。 --------