Deep Learning Recommendation Models (DLRMs) have gained popularity in recommendation systems due to their effectiveness in handling large-scale recommendation tasks. The embedding layers of DLRMs have become the performance bottleneck due to their intensive needs on memory capacity and memory bandwidth. In this paper, we propose UpDLRM, which utilizes real-world processingin-memory (PIM) hardware, UPMEM DPU, to boost the memory bandwidth and reduce recommendation latency. The parallel nature of the DPU memory can provide high aggregated bandwidth for the large number of irregular memory accesses in embedding lookups, thus offering great potential to reduce the inference latency. To fully utilize the DPU memory bandwidth, we further studied the embedding table partitioning problem to achieve good workload-balance and efficient data caching. Evaluations using real-world datasets show that, UpDLRM achieves much lower inference time for DLRM compared to both CPU-only and CPU-GPU hybrid counterparts.
翻译:深度学习推荐模型(DLRM)因其在处理大规模推荐任务中的有效性,已在推荐系统中广受欢迎。DLRM的嵌入层由于对内存容量和内存带宽的极高需求,已成为性能瓶颈。本文提出UpDLRM,它利用真实世界的存内处理(PIM)硬件——UPMEM DPU——来提升内存带宽并降低推荐延迟。DPU内存的并行特性可为嵌入查找中大量不规则的内存访问提供高聚合带宽,从而为降低推理延迟提供了巨大潜力。为了充分利用DPU内存带宽,我们进一步研究了嵌入表划分问题,以实现良好的工作负载平衡和高效的数据缓存。基于真实世界数据集的评估表明,与纯CPU以及CPU-GPU混合方案相比,UpDLRM为DLRM实现了显著更低的推理时间。