The development of deep learning architectures is a resource-demanding process, due to a vast design space, long prototyping times, and high compute costs associated with at-scale model training and evaluation. We set out to simplify this process by grounding it in an end-to-end mechanistic architecture design (MAD) pipeline, encompassing small-scale capability unit tests predictive of scaling laws. Through a suite of synthetic token manipulation tasks such as compression and recall, designed to probe capabilities, we identify and test new hybrid architectures constructed from a variety of computational primitives. We experimentally validate the resulting architectures via an extensive compute-optimal and a new state-optimal scaling law analysis, training over 500 language models between 70M to 7B parameters. Surprisingly, we find MAD synthetics to correlate with compute-optimal perplexity, enabling accurate evaluation of new architectures via isolated proxy tasks. The new architectures found via MAD, based on simple ideas such as hybridization and sparsity, outperform state-of-the-art Transformer, convolutional, and recurrent architectures (Transformer++, Hyena, Mamba) in scaling, both at compute-optimal budgets and in overtrained regimes. Overall, these results provide evidence that performance on curated synthetic tasks can be predictive of scaling laws, and that an optimal architecture should leverage specialized layers via a hybrid topology.
翻译:深度学习架构的开发是一个资源密集型过程,原因在于设计空间庞大、原型开发周期长,以及大规模模型训练与评估带来的高昂计算成本。我们旨在通过构建端到端的机理架构设计(MAD)流程来简化这一过程,该流程包含可预测规模缩放定律的小规模能力单元测试。通过一套设计用于探测能力的合成令牌操作任务(如压缩与记忆),我们识别并测试了由多种计算原语构建的新型混合架构。我们通过大规模计算最优与一种新的状态最优规模缩放定律分析,对所得架构进行实验验证,训练了500余个参数规模从7000万到70亿的语言模型。令人惊讶的是,我们发现MAD合成任务与计算最优困惑度存在相关性,从而能够通过隔离的代理任务准确评估新架构。基于混合化与稀疏性等简单思想,通过MAD发现的新架构在计算最优预算与过度训练场景下的规模缩放表现均优于最先进的Transformer、卷积与循环架构(Transformer++、Hyena、Mamba)。总体而言,这些结果证明,在精心设计的合成任务上的表现能够预测规模缩放定律,且最优架构应通过混合拓扑结构利用专用层。