When training large-scale models, the performance typically scales with the number of parameters and the dataset size according to a slow power law. A fundamental theoretical and practical question is whether comparable performance can be achieved with significantly smaller models and substantially less data. In this work, we provide a positive and constructive answer. We prove that a generic permutation-invariant function of $d$ objects can be asymptotically compressed into a function of $\operatorname{polylog} d$ objects with vanishing error, which is proved to be the optimal compression rate. This theorem yields two key implications: (Ia) a large neural network can be compressed to polylogarithmic width while preserving its learning dynamics; (Ib) a large dataset can be compressed to polylogarithmic size while leaving the loss landscape of the corresponding model unchanged. Implication (Ia) directly establishes a proof of the dynamical lottery ticket hypothesis, which states that any ordinary network can be strongly compressed such that the learning dynamics and result remain unchanged. (Ib) shows that a neural scaling law of the form $L\sim d^{-α}$ can be boosted to an arbitrarily fast power law decay, and ultimately to $\exp(-α' \sqrt[m]{d})$.
翻译:在训练大规模模型时,性能通常随参数数量与数据集规模按缓慢的幂律关系缩放。一个基础性的理论与实际问题在于:能否通过显著更小的模型和大幅减少的数据实现相当的性能。本工作给出了肯定且建设性的答案。我们证明,对于$d$个对象的通用置换不变函数,可渐近压缩为仅依赖$\operatorname{polylog} d$个对象的函数,且误差趋于零,该压缩率被证明是最优的。此定理导出两个关键推论:(Ia)大型神经网络可被压缩至对数多项式宽度,同时保持其学习动态不变;(Ib)大型数据集可被压缩至对数多项式规模,且不改变对应模型的损失景观。推论(Ia)直接为动态彩票假设提供了证明,该假设断言任何普通网络均可被强压缩,使得学习动态与结果保持不变。(Ib)则表明形如$L\sim d^{-α}$的神经缩放律可被加速至任意快的幂律衰减,最终达到$\exp(-α' \sqrt[m]{d})$的形式。