Serving deep learning based recommendation models (DLRM) at scale is challenging. Existing approaches rely on dedicated ANN indexing and filtering services on CPUs, suffering from non-negligible costs and missing co-design opportunities. Such inefficiency makes them difficult to support complex model architectures, such as learned similarities and multi-task retrieval. In this paper, we present SilverTorch, a model-based serving system that brings all components into one unified model. It unifies model serving by replacing standalone indexing and filtering services with model layers. We propose a model-based GPU Bloom index for feature filtering and a fused Int8 ANN kernel for nearest neighbor search. Through co-design of the ANN search and feature filtering, we reduce GPU memory usage and eliminate computation. Benefiting from this design, we scale up retrieval by introducing an OverArch scoring layer and a multi-task retrieval with a Value Model to aggregate scores. These advancements improve the retrieval accuracy and enable future studies for serving more complex models. Our evaluation on industry-scale datasets show that SilverTorch achieves up to 23.7\times higher throughput compared to the state-of-the-art approaches. We also demonstrate that SilverTorch solution is 13.35\times more cost-efficient than CPU-based solution while improving accuracy via serving more complex models. SilverTorch is deployed at scale, serving hundreds of models online and supporting recommendation for diverse applications.
翻译:基于深度学习的推荐模型(DLRM)的大规模服务具有挑战性。现有方法依赖CPU上的专用ANN索引和过滤服务,存在不可忽视的成本问题,并错失了协同设计的机会。这种低效性使其难以支持复杂模型架构,例如学习型相似度计算和多任务检索。本文提出SilverTorch——一个基于模型的服务系统,将所有组件整合为统一模型。它通过用模型层替代独立的索引和过滤服务来统一模型服务。我们提出基于模型的GPU Bloom索引用于特征过滤,以及融合Int8的ANN内核用于最近邻搜索。通过ANN搜索与特征过滤的协同设计,我们减少了GPU内存使用并消除了计算开销。得益于该设计,我们引入OverArch评分层和基于Value Model的多任务检索聚合分数,实现了检索规模的扩展。这些进步提升了检索准确性,并为未来更复杂模型的服务研究奠定了基础。在工业级数据集上的评估表明,与最先进方法相比,SilverTorch的吞吐量提升高达23.7倍。我们还证明,SilverTorch解决方案的成本效率是CPU方案的13.35倍,同时通过服务更复杂的模型提高了准确性。SilverTorch已大规模部署,在线服务数百个模型并支持多样化应用场景的推荐。