While scaling laws for recommendation models have gained significant traction, existing architectures such as Wukong, HiFormer and DHEN, often struggle with sub-optimal designs and hardware under-utilization, limiting their practical scalability. Our previous TokenMixer architecture (introduced in RankMixer paper) addressed effectiveness and efficiency by replacing self-attention with a ightweight token-mixing operator; however, it faced critical bottlenecks in deeper configurations, including sub-optimal residual paths, vanishing gradients, incomplete MoE sparsification and constrained scalability. In this paper, we propose TokenMixer-Large, a systematically evolved architecture designed for extreme-scale recommendation. By introducing a mixing-and-reverting operation, inter-layer residuals and the auxiliary loss, we ensure stable gradient propagation even as model depth increases. Furthermore, we incorporate a Sparse Per-token MoE to enable efficient parameter expansion. TokenMixer-Large successfully scales its parameters to 7-billion and 15-billion on online traffic and offline experiments, respectively. Currently deployed in multiple scenarios at ByteDance, TokenMixer-Large has achieved significant offline and online performance gains, delivering an increase of +1.66\% in orders and +2.98\% in per-capita preview payment GMV for e-commerce, improving ADSS by +2.0\% in advertising and achieving a +1.4\% revenue growth for live streaming.
翻译:尽管推荐模型的扩展定律已获得广泛关注,但现有架构如Wukong、HiFormer和DHEN等,常因次优设计和硬件利用率不足而受限,影响了其实际可扩展性。我们先前提出的TokenMixer架构(在RankMixer论文中引入)通过以轻量级token混合算子替代自注意力机制,兼顾了效果与效率;然而,在更深层配置中它遇到了关键瓶颈,包括次优残差路径、梯度消失、不完整的MoE稀疏化以及受限的可扩展性。本文提出TokenMixer-Large,一种为超大规模推荐系统设计的系统性演进架构。通过引入混合-还原操作、层间残差连接以及辅助损失函数,我们确保了即使在模型深度增加时梯度也能稳定传播。此外,我们引入了稀疏逐token MoE以实现高效的参数扩展。TokenMixer-Large成功在在线流量和离线实验中分别将参数量扩展至70亿和150亿。目前该模型已在字节跳动的多个场景中部署,并取得了显著的离线和在线性能提升:在电商场景中订单量提升+1.66%、人均预览支付GMV提升+2.98%;在广告场景中ADSS指标提升+2.0%;在直播场景中实现收入增长+1.4%。