Recently, 1-bit Large Language Models (LLMs) have emerged, showcasing an impressive combination of efficiency and performance that rivals traditional LLMs. Research by Wang et al. (2023); Ma et al. (2024) indicates that the performance of these 1-bit LLMs progressively improves as the number of parameters increases, hinting at the potential existence of a Scaling Law for 1-bit Neural Networks. In this paper, we present the first theoretical result that rigorously establishes this scaling law for 1-bit models. We prove that, despite the constraint of weights restricted to $\{-1, +1\}$, the dynamics of model training inevitably align with kernel behavior as the network width grows. This theoretical breakthrough guarantees convergence of the 1-bit model to an arbitrarily small loss as width increases. Furthermore, we introduce the concept of the generalization difference, defined as the gap between the outputs of 1-bit networks and their full-precision counterparts, and demonstrate that this difference maintains a negligible level as network width scales. Building on the work of Kaplan et al. (2020), we conclude by examining how the training loss scales as a power-law function of the model size, dataset size, and computational resources utilized for training. Our findings underscore the promising potential of scaling 1-bit neural networks, suggesting that int1 could become the standard in future neural network precision.
翻译:近年来,1位大语言模型(LLMs)崭露头角,展现出令人印象深刻的效率与性能组合,足以媲美传统LLMs。Wang等人(2023)和Ma等人(2024)的研究表明,随着参数数量的增加,这些1位LLMs的性能逐步提升,暗示了1位神经网络可能存在缩放定律。本文首次提出严格确立1位模型缩放定律的理论结果。我们证明,尽管权重被限制在$\{-1, +1\}$,随着网络宽度的增加,模型训练的动态过程不可避免地趋近于核行为。这一理论突破保证了1位模型在宽度增加时能够收敛到任意小的损失。此外,我们引入了泛化差异的概念,定义为1位网络输出与其全精度对应网络输出之间的差距,并证明该差异在网络宽度缩放时保持在可忽略的水平。基于Kaplan等人(2020)的工作,我们最后研究了训练损失如何作为模型大小、数据集大小和训练所用计算资源的幂律函数进行缩放。我们的研究结果突显了缩放1位神经网络的广阔潜力,表明int1可能成为未来神经网络精度的标准。