STEM: Scaling Transformers with Embedding Modules

Fine-grained sparsity promises higher parametric capacity without proportional per-token compute, but often suffers from training instability, load balancing, and communication overhead. We introduce STEM (Scaling Transformers with Embedding Modules), a static, token-indexed approach that replaces the FFN up-projection with a layer-local embedding lookup while keeping the gate and down-projection dense. This removes runtime routing, enables CPU offload with asynchronous prefetch, and decouples capacity from both per-token FLOPs and cross-device communication. Empirically, STEM trains stably despite extreme sparsity. It improves downstream performance over dense baselines while reducing per-token FLOPs and parameter accesses (eliminating roughly one-third of FFN parameters). STEM learns embedding spaces with large angular spread which enhances its knowledge storage capacity. More interestingly, this enhanced knowledge capacity comes with better interpretability. The token-indexed nature of STEM embeddings allows simple ways to perform knowledge editing and knowledge injection in an interpretable manner without any intervention in the input text or additional computation. In addition, STEM strengthens long-context performance: as sequence length grows, more distinct parameters are activated, yielding practical test-time capacity scaling. Across 350M and 1B model scales, STEM delivers up to ~3--4% accuracy improvements overall, with notable gains on knowledge and reasoning-heavy benchmarks (ARC-Challenge, OpenBookQA, GSM8K, MMLU). Overall, STEM is an effective way of scaling parametric memory while providing better interpretability, better training stability and improved efficiency.

翻译：细粒度稀疏性能够在不成比例增加每令牌计算量的前提下提供更高的参数容量，但常面临训练不稳定、负载均衡和通信开销等问题。本文提出STEM（基于嵌入模块的Transformer扩展方法），这是一种静态的令牌索引方法，它将前馈网络的上投影层替换为层局部嵌入查找，同时保持门控机制和下投影层的稠密性。该方法消除了运行时路由机制，支持通过异步预取实现CPU卸载，并将模型容量与每令牌浮点运算量及跨设备通信解耦。实验表明，STEM即使在极端稀疏条件下也能保持稳定训练。与稠密基线相比，它在降低每令牌浮点运算量和参数访问量（约消除三分之一前馈网络参数）的同时，提升了下游任务性能。STEM学习的嵌入空间具有较大的角度分布范围，从而增强了知识存储容量。更有趣的是，这种增强的知识容量伴随着更好的可解释性。STEM嵌入的令牌索引特性使得无需干预输入文本或增加额外计算量，即可通过简单方式以可解释的形式实现知识编辑与知识注入。此外，STEM增强了长上下文处理能力：随着序列长度增加，更多差异化参数被激活，实现了实际测试时的容量扩展。在3.5亿和10亿参数规模的模型中，STEM整体准确率提升约3-4%，在知识密集型和推理密集型基准测试（ARC-Challenge、OpenBookQA、GSM8K、MMLU）上表现尤为突出。总体而言，STEM是一种扩展参数记忆的有效方法，同时提供了更好的可解释性、更优的训练稳定性和更高的效率。