TokenMixer-Large: Scaling Up Large Ranking Models in Industrial Recommenders

Yuchen Jiang,Jie Zhu,Xintian Han,Hui Lu,Kunmin Bai,Mingyu Yang,Shikang Wu,Ruihao Zhang,Wenlin Zhao,Shipeng Bai,Sijin Zhou,Huizhi Yang,Tianyi Liu,Wenda Liu,Ziyan Gong,Haoran Ding,Zheng Chai,Deping Xie,Zhe Chen,Yuchao Zheng,Peng Xu

In recent years, the study of scaling laws for large recommendation models has gradually gained attention. Works such as Wukong, HiFormer, and DHEN have attempted to increase the complexity of interaction structures in ranking models and validate scaling laws between performance and parameters/FLOPs by stacking multiple layers. However, their experimental scale remains relatively limited. Our previous work introduced the TokenMixer architecture, an efficient variant of the standard Transformer where the self-attention mechanism is replaced by a simple reshape operation, and the feed-forward network is adapted to a pertoken FFN. The effectiveness of this architecture was demonstrated in the ranking stage by the model presented in the RankMixer paper. However, this foundational TokenMixer architecture itself has several design limitations. In this paper, we propose TokenMixer-Large, which systematically addresses these core issues: sub-optimal residual design, insufficient gradient updates in deep models, incomplete MoE sparsification, and limited exploration of scalability. By leveraging a mixing-and-reverting operation, inter-layer residuals, the auxiliary loss and a novel Sparse-Pertoken MoE architecture, TokenMixer-Large successfully scales its parameters to 7-billion and 15-billion on online traffic and offline experiments, respectively. Currently deployed in multiple scenarios at ByteDance, TokenMixer -Large has achieved significant offline and online performance gains.

翻译：近年来，大规模推荐模型的扩展规律研究逐渐受到关注。Wukong、HiFormer 和 DHEN 等工作尝试通过堆叠多层结构来增加排序模型中交互结构的复杂度，并验证性能与参数量/浮点运算量之间的缩放规律。然而，其实验规模仍相对有限。我们先前的工作提出了 TokenMixer 架构，这是标准 Transformer 的一种高效变体，其中自注意力机制被简单的重塑操作所替代，前馈网络则调整为逐令牌前馈网络。RankMixer 论文中提出的模型在排序阶段验证了该架构的有效性。然而，这一基础 TokenMixer 架构本身存在若干设计局限。本文提出 TokenMixer-Large，系统性地解决了以下核心问题：次优的残差设计、深层模型中梯度更新不足、MoE 稀疏化不完整以及可扩展性探索有限。通过采用混合-还原操作、层间残差连接、辅助损失函数以及一种新颖的稀疏逐令牌混合专家架构，TokenMixer-Large 在在线流量和离线实验中分别成功将参数量扩展至 70 亿和 150 亿。目前该模型已在字节跳动的多个场景中部署，并取得了显著的离线与在线性能提升。

相关内容

排序

关注 313

排序是计算机内经常进行的一种操作，其目的是将一组“无序”的记录序列调整为“有序”的记录序列。分内部排序和外部排序。若整个排序过程不需要访问外存便能完成，则称此类排序问题为内部排序。反之，若参加排序的记录数量很大，整个序列的排序过程不可能在内存中完成，则称此类排序问题为外部排序。内部排序的过程是一个逐步扩大记录的有序序列长度的过程。

【WWW2026】用于多模态推荐的基础模型个性化参数高效微调研究

专知会员服务

5+阅读 · 2月20日

【WWW2025】G-Refer：基于图检索增强的大型语言模型用于可解释推荐

专知会员服务

13+阅读 · 2025年4月8日

推荐系统中的扩散模型：综述

专知会员服务

21+阅读 · 2025年1月22日

大规模语言模型增强推荐系统：分类、趋势、应用与未来

专知会员服务

40+阅读 · 2024年12月22日