In recent years, the study of scaling laws for large recommendation models has gradually gained attention. Works such as Wukong, HiFormer, and DHEN have attempted to increase the complexity of interaction structures in ranking models and validate scaling laws between performance and parameters/FLOPs by stacking multiple layers. However, their experimental scale remains relatively limited. Our previous work introduced the TokenMixer architecture, an efficient variant of the standard Transformer where the self-attention mechanism is replaced by a simple reshape operation, and the feed-forward network is adapted to a pertoken FFN. The effectiveness of this architecture was demonstrated in the ranking stage by the model presented in the RankMixer paper. However, this foundational TokenMixer architecture itself has several design limitations. In this paper, we propose TokenMixer-Large, which systematically addresses these core issues: sub-optimal residual design, insufficient gradient updates in deep models, incomplete MoE sparsification, and limited exploration of scalability. By leveraging a mixing-and-reverting operation, inter-layer residuals, the auxiliary loss and a novel Sparse-Pertoken MoE architecture, TokenMixer-Large successfully scales its parameters to 7-billion and 15-billion on online traffic and offline experiments, respectively. Currently deployed in multiple scenarios at ByteDance, TokenMixer -Large has achieved significant offline and online performance gains.
翻译:近年来,大规模推荐模型的扩展规律研究逐渐受到关注。Wukong、HiFormer 和 DHEN 等工作尝试通过堆叠多层结构来增加排序模型中交互结构的复杂度,并验证性能与参数量/浮点运算量之间的缩放规律。然而,其实验规模仍相对有限。我们先前的工作提出了 TokenMixer 架构,这是标准 Transformer 的一种高效变体,其中自注意力机制被简单的重塑操作所替代,前馈网络则调整为逐令牌前馈网络。RankMixer 论文中提出的模型在排序阶段验证了该架构的有效性。然而,这一基础 TokenMixer 架构本身存在若干设计局限。本文提出 TokenMixer-Large,系统性地解决了以下核心问题:次优的残差设计、深层模型中梯度更新不足、MoE 稀疏化不完整以及可扩展性探索有限。通过采用混合-还原操作、层间残差连接、辅助损失函数以及一种新颖的稀疏逐令牌混合专家架构,TokenMixer-Large 在在线流量和离线实验中分别成功将参数量扩展至 70 亿和 150 亿。目前该模型已在字节跳动的多个场景中部署,并取得了显著的离线与在线性能提升。