Neural scaling laws relate loss to model size in large language models (LLMs), yet depth and width may contribute to performance differently, requiring more detailed studies. Here, we quantify how depth affects loss via analysis of LLMs and toy residual networks. We find loss scales inversely proportional to depth in LLMs, probably due to functionally similar layers reducing error through ensemble averaging rather than compositional learning or discretizing smooth dynamics. This regime is inefficient yet robust and may arise from the architectural bias of residual networks and target functions incompatible with smooth dynamics. The findings suggest that improving LLM efficiency may require architectural innovations to encourage compositional use of depth.
翻译:神经缩放定律描述了大规模语言模型(LLM)中损失与模型规模的关系,但深度与宽度对性能的影响可能不同,需要更细致的研究。本文通过分析LLM与玩具残差网络,量化了深度对损失的影响。我们发现LLM中的损失与深度成反比缩放,这很可能源于功能相似的层通过集成平均降低误差,而非通过组合学习或离散化平滑动力学实现。该机制虽效率低下但具有鲁棒性,可能源于残差网络的结构偏置以及与平滑动力学不相容的目标函数。研究结果表明,提升LLM效率可能需要通过架构创新来促进深度的组合式利用。