Recommendation models rely on deep learning networks and large embedding tables, resulting in computationally and memory-intensive processes. These models are typically trained using hybrid CPU-GPU or GPU-only configurations. The hybrid mode combines the GPU's neural network acceleration with the CPUs' memory storage and supply for embedding tables but may incur significant CPU-to-GPU transfer time. In contrast, the GPU-only mode utilizes High Bandwidth Memory (HBM) across multiple GPUs for storing embedding tables. However, this approach is expensive and presents scaling concerns. This paper introduces Hotline, a heterogeneous acceleration pipeline that addresses these concerns. Hotline develops a data-aware and model-aware scheduling pipeline by leveraging the insight that only a few embedding entries are frequently accessed (popular). This approach utilizes CPU main memory for non-popular embeddings and GPUs' HBM for popular embeddings. To achieve this, Hotline accelerator fragments a mini-batch into popular and non-popular micro-batches. It gathers the necessary working parameters for non-popular micro-batches from the CPU, while GPUs execute popular micro-batches. The hardware accelerator dynamically coordinates the execution of popular embeddings on GPUs and non-popular embeddings from the CPU's main memory. Real-world datasets and models confirm Hotline's effectiveness, reducing average end-to-end training time by 2.2x compared to Intel-optimized CPU-GPU DLRM baseline.
翻译:推荐模型依赖于深度学习网络和大型嵌入表,导致计算和内存密集型过程。这些模型通常采用混合CPU-GPU或纯GPU配置进行训练。混合模式结合了GPU的神经网络加速与CPU的内存存储及嵌入表供应,但可能带来显著的CPU到GPU传输时间。相比之下,纯GPU模式利用多GPU间的高带宽内存(HBM)存储嵌入表,然而这种方法成本高昂且存在扩展性问题。本文介绍了Hotline,一种解决这些问题的异构加速流水线。Hotline利用仅少数嵌入条目被频繁访问(即热门条目)的洞察,开发了一种数据感知和模型感知的调度流水线。该方法对非热门嵌入使用CPU主内存,而对热门嵌入使用GPU的HBM。为实现此目标,Hotline加速器将小批量数据划分为热门和非热门微批次。它从CPU收集非热门微批次所需的工作参数,同时GPU执行热门微批次。硬件加速器动态协调GPU上热门嵌入与CPU主内存中非热门嵌入的执行。基于真实数据集和模型的实验证实了Hotline的有效性,与Intel优化的CPU-GPU DLRM基线相比,平均端到端训练时间减少了2.2倍。