SilverTorch: A Unified Model-based System to Democratize Large-Scale Recommendation on GPUs

Bi Xue,Hong Wu,Lei Chen,Chao Yang,Yiming Ma,Fei Ding,Zhen Wang,Liang Wang,Xiaoheng Mao,Ke Huang,Xialu Li,Peng Xia,Rui Jian,Yanli Zhao,Yanzun Huang,Yijie Deng,Harry Tran,Ryan Chang,Min Yu,Eric Dong,Jiazhou Wang,Qianqian Zhang,Keke Zhai,Hongzhang Yin,Pawel Garbacki,Jiaqi Zhai,Zheng Fang,Yiyi Pan,Min Ni,Kevin Greer,Rui Zhang,Yang Liu

Serving deep learning based recommendation models (DLRM) at scale is challenging. Existing approaches rely on dedicated ANN indexing and filtering services on CPUs, suffering from non-negligible costs and missing co-design opportunities. Such inefficiency makes them difficult to support complex model architectures, such as learned similarities and multi-task retrieval. In this paper, we present SilverTorch, a model-based serving system that brings all components into one unified model. It unifies model serving by replacing standalone indexing and filtering services with model layers. We propose a model-based GPU Bloom index for feature filtering and a fused Int8 ANN kernel for nearest neighbor search. Through co-design of the ANN search and feature filtering, we reduce GPU memory usage and eliminate computation. Benefiting from this design, we scale up retrieval by introducing an OverArch scoring layer and a multi-task retrieval with a Value Model to aggregate scores. These advancements improve the retrieval accuracy and enable future studies for serving more complex models. Our evaluation on industry-scale datasets show that SilverTorch achieves up to 23.7\times higher throughput compared to the state-of-the-art approaches. We also demonstrate that SilverTorch solution is 13.35\times more cost-efficient than CPU-based solution while improving accuracy via serving more complex models. SilverTorch is deployed at scale, serving hundreds of models online and supporting recommendation for diverse applications.

翻译：基于深度学习的推荐模型（DLRM）的大规模服务具有挑战性。现有方法依赖CPU上的专用ANN索引和过滤服务，存在不可忽视的成本问题，并错失了协同设计的机会。这种低效性使其难以支持复杂模型架构，例如学习型相似度计算和多任务检索。本文提出SilverTorch——一个基于模型的服务系统，将所有组件整合为统一模型。它通过用模型层替代独立的索引和过滤服务来统一模型服务。我们提出基于模型的GPU Bloom索引用于特征过滤，以及融合Int8的ANN内核用于最近邻搜索。通过ANN搜索与特征过滤的协同设计，我们减少了GPU内存使用并消除了计算开销。得益于该设计，我们引入OverArch评分层和基于Value Model的多任务检索聚合分数，实现了检索规模的扩展。这些进步提升了检索准确性，并为未来更复杂模型的服务研究奠定了基础。在工业级数据集上的评估表明，与最先进方法相比，SilverTorch的吞吐量提升高达23.7倍。我们还证明，SilverTorch解决方案的成本效率是CPU方案的13.35倍，同时通过服务更复杂的模型提高了准确性。SilverTorch已大规模部署，在线服务数百个模型并支持多样化应用场景的推荐。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

关于大语言模型驱动的推荐系统智能体的综述

专知会员服务

29+阅读 · 2025年2月17日

推荐系统中的扩散模型：综述

专知会员服务

21+阅读 · 2025年1月22日

【斯坦福博士论文】面向行业级神经推荐的数据驱动统计分片，110页pdf

专知会员服务

21+阅读 · 2023年4月6日

【干货书】深度学习系统: 大规模生产的算法、编译器和处理器，267页pdf

专知会员服务

91+阅读 · 2022年6月1日