We explore the impact of parameter sparsity on the scaling behavior of Transformers trained on massive datasets (i.e., "foundation models"), in both vision and language domains. In this setting, we identify the first scaling law describing the relationship between weight sparsity, number of non-zero parameters, and amount of training data, which we validate empirically across model and data scales; on ViT/JFT-4B and T5/C4. These results allow us to characterize the "optimal sparsity", the sparsity level which yields the best performance for a given effective model size and training budget. For a fixed number of non-zero parameters, we identify that the optimal sparsity increases with the amount of data used for training. We also extend our study to different sparsity structures (such as the hardware-friendly n:m pattern) and strategies (such as starting from a pretrained dense model). Our findings shed light on the power and limitations of weight sparsity across various parameter and computational settings, offering both theoretical understanding and practical implications for leveraging sparsity towards computational efficiency improvements.
翻译:我们探究了参数稀疏性对在大规模数据集上训练的Transformer(即“基础模型”)在视觉和语言领域缩放行为的影响。在此背景下,我们首次推导出描述权重稀疏性、非零参数数量与训练数据量之间关系的缩放定律,并通过跨模型和数据规模(ViT/JFT-4B与T5/C4)的实验验证了该定律。这些结果使我们能够表征“最优稀疏度”——即在给定有效模型规模与训练预算下取得最佳性能的稀疏水平。对于固定数量的非零参数,我们发现最优稀疏度随训练数据量的增加而提升。此外,我们将研究扩展至不同稀疏结构(如硬件友好的n:m模式)与策略(如从预训练稠密模型开始)。我们的发现揭示了权重稀疏性在不同参数与计算配置下的能力与局限,为利用稀疏性提升计算效率提供了理论理解与实践指导。