Scaling laws play a central role in the success of Large Language Models (LLMs), enabling the prediction of model performance relative to compute budgets prior to training. While Transformers have been the dominant architecture, recent alternatives such as xLSTM offer linear complexity with respect to context length while remaining competitive in the billion-parameter regime. We conduct a comparative investigation on the scaling behavior of Transformers and xLSTM along the following lines, providing insights to guide future model design and deployment. First, we study the scaling behavior for xLSTM in compute-optimal and over-training regimes using both IsoFLOP and parametric fit approaches on a wide range of model sizes (80M-7B) and number of training tokens (2B-2T). Second, we examine the dependence of optimal model sizes on context length, a pivotal aspect that was largely ignored in previous work. Finally, we analyze inference-time scaling characteristics. Our findings reveal that in typical LLM training and inference scenarios, xLSTM scales favorably compared to Transformers. Notably, xLSTM models consistently Pareto-dominate Transformer models, delivering lower cross-entropy loss for the same compute budget.
翻译:缩放定律在大型语言模型(LLM)的成功中起着核心作用,它使得在训练前就能根据计算预算预测模型性能。尽管 Transformer 一直是主导架构,但最近的替代方案如 xLSTM 在保持数十亿参数规模竞争力的同时,对上下文长度具有线性复杂度。我们沿着以下方向对 Transformer 和 xLSTM 的缩放行为进行了比较研究,为未来的模型设计与部署提供指导。首先,我们在广泛的模型规模(80M-7B)和训练词元数量(2B-2T)上,使用 IsoFLOP 和参数拟合两种方法,研究了 xLSTM 在计算最优和过训练状态下的缩放行为。其次,我们考察了最优模型大小对上下文长度的依赖性,这是先前工作中很大程度上被忽视的关键方面。最后,我们分析了推理时的缩放特性。我们的研究结果表明,在典型的 LLM 训练和推理场景中,xLSTM 相比 Transformer 展现出更有利的缩放特性。值得注意的是,xLSTM 模型始终帕累托占优于 Transformer 模型,即在相同的计算预算下实现了更低的交叉熵损失。