MixFormer: Co-Scaling Up Dense and Sequence in Industrial Recommenders

As industrial recommender systems enter a scaling-driven regime, Transformer architectures have become increasingly attractive for scaling models towards larger capacity and longer sequence. However, existing Transformer-based recommendation models remain structurally fragmented, where sequence modeling and feature interaction are implemented as separate modules with independent parameterization. Such designs introduce a fundamental co-scaling challenge, as model capacity must be suboptimally allocated between dense feature interaction and sequence modeling under a limited computational budget. In this work, we propose MixFormer, a unified Transformer-style architecture tailored for recommender systems, which jointly models sequential behaviors and feature interactions within a single backbone. Through a unified parameterization, MixFormer enables effective co-scaling across both dense capacity and sequence length, mitigating the trade-off observed in decoupled designs. Moreover, the integrated architecture facilitates deep interaction between sequential and non-sequential representations, allowing high-order feature semantics to directly inform sequence aggregation and enhancing overall expressiveness. To ensure industrial practicality, we further introduce a user-item decoupling strategy for efficiency optimizations that significantly reduce redundant computation and inference latency. Extensive experiments on large-scale industrial datasets demonstrate that MixFormer consistently exhibits superior accuracy and efficiency. Furthermore, large-scale online A/B tests on two production recommender systems, Douyin and Douyin Lite, show consistent improvements in user engagement metrics, including active days and in-app usage duration.

翻译：随着工业推荐系统进入以扩展为主导的发展阶段，Transformer架构因其能够将模型扩展至更大容量和更长序列而日益受到青睐。然而，现有的基于Transformer的推荐模型在结构上仍存在割裂，其中序列建模和特征交互被实现为具有独立参数化的独立模块。这种设计引入了一个根本性的协同扩展挑战：在有限的计算预算下，模型容量必须在稠密特征交互和序列建模之间进行次优分配。本文提出MixFormer，一种专为推荐系统设计的统一Transformer风格架构，它在单一骨干网络中联合建模序列行为和特征交互。通过统一的参数化，MixFormer实现了稠密容量与序列长度两方面的有效协同扩展，缓解了在解耦设计中观察到的权衡问题。此外，这种集成架构促进了序列表示与非序列表示之间的深度交互，使得高阶特征语义能够直接指导序列聚合，从而增强了整体表达能力。为确保工业实用性，我们进一步引入了一种用户-物品解耦策略以进行效率优化，该策略显著减少了冗余计算并降低了推理延迟。在大规模工业数据集上的广泛实验表明，MixFormer始终展现出卓越的准确性和效率。此外，在两个生产推荐系统（抖音和抖音极速版）上进行的大规模在线A/B测试显示，用户参与度指标（包括活跃天数和应用内使用时长）均获得持续提升。