While data scaling laws of large language models (LLMs) have been widely examined in the one-pass regime with massive corpora, their form under limited data and repeated epochs remains largely unexplored. This paper presents a theoretical analysis of how a common workaround, training for multiple epochs on the same dataset, reshapes the data scaling laws in linear regression. Concretely, we ask: to match the performance of training on a dataset of size $N$ for $K$ epochs, how much larger must a dataset be if the model is trained for only one pass? We quantify this using the \textit{effective reuse rate} of the data, $E(K, N)$, which we define as the multiplicative factor by which the dataset must grow under one-pass training to achieve the same test loss as $K$-epoch training. Our analysis precisely characterizes the scaling behavior of $E(K, N)$ for SGD in linear regression under either strong convexity or Zipf-distributed data: (1) When $K$ is small, we prove that $E(K, N) \approx K$, indicating that every new epoch yields a linear gain; (2) As $K$ increases, $E(K, N)$ plateaus at a problem-dependent value that grows with $N$ ($Θ(\log N)$ for the strongly-convex case), implying that larger datasets can be repeated more times before the marginal benefit vanishes. These theoretical findings point out a neglected factor in a recent empirical study (Muennighoff et al. (2023)), which claimed that training LLMs for up to $4$ epochs results in negligible loss differences compared to using fresh data at each step, \textit{i.e.}, $E(K, N) \approx K$ for $K \le 4$ in our notation. Supported by further empirical validation with LLMs, our results reveal that the maximum $K$ value for which $E(K, N) \approx K$ in fact depends on the data size and distribution, and underscore the need to explicitly model both factors in future studies of scaling laws with data reuse.
翻译:尽管大型语言模型(LLM)在单次遍历海量语料库时的数据缩放规律已得到广泛研究,但在有限数据与多轮次重复训练下的形式仍鲜有探索。本文从理论上分析了一种常见变通方法——在同一数据集上进行多轮次训练——如何重塑线性回归中的数据缩放规律。具体而言,我们提出:若要在仅单次遍历训练下达到与使用规模为$N$的数据集训练$K$轮相当的性能,数据集需要扩大多少?我们通过数据的\textit{有效复用率}$E(K, N)$来量化这一关系,其定义为:为达到与$K$轮训练相同的测试损失,单次遍历训练所需数据集规模需扩大的倍数。我们的分析精确刻画了在线性回归中,对于强凸或Zipf分布数据,随机梯度下降(SGD)下$E(K, N)$的缩放行为:(1)当$K$较小时,我们证明$E(K, N) \approx K$,表明每新增一轮训练带来线性增益;(2)随着$K$增大,$E(K, N)$会收敛于一个随$N$增长的、问题相关的值(强凸情形下为$Θ(\log N)$),这意味着更大数据集在边际收益消失前可被重复训练更多轮次。这些理论发现指出了一个近期实证研究(Muennighoff等人(2023))中被忽略的因素,该研究声称训练LLM至多$4$轮与每步使用新数据相比损失差异可忽略,即用我们的符号表示为$K \le 4$时$E(K, N) \approx K$。基于对LLM的进一步实证验证,我们的结果表明$E(K, N) \approx K$成立的最大$K$值实际上取决于数据规模与分布,并强调在未来研究数据复用的缩放规律时,必须同时对这两个因素进行显式建模。