Understanding architectural differences in language models is challenging, especially at academic-scale pretraining (e.g., 1.3B parameters, 100B tokens), where results are often dominated by noise and randomness. To overcome this, we introduce controlled synthetic pretraining tasks that isolate and evaluate core model capabilities. Within this framework, we discover CANON LAYERS: lightweight architectural components -- named after the musical term "canon" -- that promote horizontal information flow across neighboring tokens. Canon layers compute weighted sums of nearby token representations and integrate seamlessly into Transformers, linear attention, state-space models, or any sequence architecture. We present 12 key results. This includes how Canon layers enhance reasoning depth (e.g., by $2\times$), reasoning breadth, knowledge manipulation, etc. They lift weak architectures like NoPE to match RoPE, and linear attention to rival SOTA linear models like Mamba2/GDN -- validated both through synthetic tasks and real-world academic-scale pretraining. This synthetic playground offers an economical, principled path to isolate core model capabilities often obscured at academic scales. Equipped with infinite high-quality data, it may even PREDICT how future architectures will behave as training pipelines improve -- e.g., through better data curation or RL-based post-training -- unlocking deeper reasoning and hierarchical inference.
翻译:理解语言模型架构差异具有挑战性,尤其是在学术规模预训练(例如13亿参数、1000亿标记)中,其结果常被噪声和随机性主导。为克服此问题,我们引入了受控的合成预训练任务,以隔离和评估模型的核心能力。在此框架内,我们发现了CANON层:一种轻量级架构组件——其命名源自音乐术语“卡农”——旨在促进相邻标记间的横向信息流。Canon层计算邻近标记表示的加权和,并可无缝集成到Transformer、线性注意力、状态空间模型或任何序列架构中。我们展示了12项关键结果,包括Canon层如何提升推理深度(例如提升2倍)、推理广度、知识操作能力等。它们能使弱架构(如NoPE)提升至匹配RoPE的水平,并使线性注意力模型达到与Mamba2/GDN等最先进线性模型相当的性能——这已通过合成任务和真实世界学术规模预训练的双重验证。该合成实验环境提供了一条经济且原理清晰的研究路径,能够分离出在学术规模下常被掩盖的核心模型能力。借助无限的高质量数据,它甚至可能预测未来架构在训练流程改进(例如通过更好的数据策展或基于强化学习的后训练)下的行为表现,从而解锁更深层次的推理与分层推断能力。