Compute Only Once: UG-Separation for Efficient Large Recommendation Models

Hui Lu,Zheng Chai,Shipeng Bai,Hao Zhang,Zhifang Fan,Kunmin Bai,Yingwen Wu,Bingzheng Wei,Xiang Sun,Ziyan Gong,Tianyi Liu,Hua Chen,Deping Xie,Zhongkai Chen,Zhiliang Guo,Qiwei Chen,Yuchao Zheng

from arxiv, Large Recommender Model, Industrial Recommenders, Scaling Law

Driven by scaling laws, recommender systems increasingly rely on large-scale models to capture complex feature interactions and user behaviors, but this trend also leads to prohibitive training and inference costs. While long-sequence models(e.g., LONGER) can reuse user-side computation through KV caching, such reuse is difficult in dense feature interaction architectures(e.g., RankMixer), where user and group (candidate item) features are deeply entangled across layers. In this work, we propose User-Group Separation (UG-Sep), a novel framework that enables reusable user-side computation in dense interaction models for the first time. UG-Sep introduces a masking mechanism that explicitly disentangles user-side and item-side information flows within token-mixing layers, ensuring that a subset of tokens to preserve purely user-side representations across layers. This design enables corresponding token computations to be reused across multiple samples, significantly reducing redundant inference cost. To compensate for potential expressiveness loss induced by masking, we further propose an Information Compensation strategy that adaptively reconstructs suppressed user-item interactions. Moreover, as UG-Sep substantially reduces user-side FLOPs and exposes memory-bound components, we incorporate W8A16 (8-bit weight, 16-bit activation) weight-only quantization to alleviate memory bandwidth bottlenecks and achieve additional acceleration. We conduct extensive offline evaluations and large-scale online A/B experiments at ByteDance, demonstrating that UG-Sep reduces inference latency by up to 20 percent without degrading online user experience or commercial metrics across multiple business scenarios, including feed recommendation and advertising systems.

翻译：受扩展定律驱动，推荐系统日益依赖大规模模型以捕捉复杂的特征交互与用户行为，但这一趋势也导致训练与推理成本急剧上升。虽然长序列模型（如LONGER）可通过KV缓存复用用户侧计算，但在稠密特征交互架构（如RankMixer）中此类复用难以实现，因为用户与组（候选物品）特征在多层网络中深度纠缠。本研究首次提出用户-组分离（UG-Sep）框架，在稠密交互模型中实现可复用的用户侧计算。UG-Sep通过引入掩码机制，在令牌混合层内显式解耦用户侧与物品侧信息流，确保部分令牌在跨层传播时保持纯粹的用户侧表征。该设计使得对应令牌计算可在多个样本间复用，显著降低冗余推理开销。为补偿掩码可能导致的表达能力损失，我们进一步提出信息补偿策略，自适应地重建被抑制的用户-物品交互。此外，由于UG-Sep大幅减少用户侧浮点运算量并暴露内存受限组件，我们采用W8A16（8位权重、16位激活值）仅权重量化方案以缓解内存带宽瓶颈，实现额外加速。我们在字节跳动开展大规模离线评估与在线A/B实验，结果表明UG-Sep在信息流推荐与广告系统等多业务场景中，推理延迟降低最高达20%，且在线用户体验与商业指标均无衰减。