We introduce a model for neural scaling laws under sparse activations. In the model, test loss is often dominated by rare coordinates that are never observed in the training input. This mechanism induces a novel bottleneck absent from dense models. We derive the asymptotic population loss in both the underparameterized and overparameterized regimes, and show that the loss exhibits a double-descent peak near the interpolation threshold -- where the number of parameters is just sufficient to fit the training data -- resulting in a loss curve governed by two distinct scaling exponents -- one for the overparameterized regime and one for the underparameterized regime -- with a gap determined by the degree of sparsity. Additionally, we derive a compute-optimal frontier that favors increasing dataset size over model capacity under fixed compute budgets. We also analyze gradient-descent dynamics and identify a scaling law for the probability that fixed-step gradient descent becomes unstable. We further show that the sparsity-induced effect persists under nonlinear activations.
翻译:我们提出了一种在稀疏激活条件下神经缩放定律的模型。在该模型中,测试损失常由训练输入中从未观测到的稀有坐标主导。这一机制引入了密集模型所不具备的新型瓶颈。我们推导了欠参数化与过参数化两种情形下的渐近总体损失,并证明损失在插值阈值附近呈现双下降峰——此时参数数量恰好足以拟合训练数据——导致损失曲线由两个不同的缩放指数控制(一个用于过参数化区域,另一个用于欠参数化区域),两者之间的差距由稀疏度决定。此外,我们推导了计算最优边界,表明在固定计算预算下优先增加数据集规模而非模型容量具有优势。我们还分析了梯度下降动力学,并确定了固定步长梯度下降失稳概率的缩放定律。进一步研究表明,非线性激活函数下稀疏性引发的效应依然存在。