Collaborative filtering (CF) has been proven to be one of the most effective techniques for recommendation. Among all CF approaches, SimpleX is the state-of-the-art method that adopts a novel loss function and a proper number of negative samples. However, there is no work that optimizes SimpleX on multi-core CPUs, leading to limited performance. To this end, we perform an in-depth profiling and analysis of existing SimpleX implementations and identify their performance bottlenecks including (1) irregular memory accesses, (2) unnecessary memory copies, and (3) redundant computations. To address these issues, we propose an efficient CF training system (called HEAT) that fully enables the multi-level caching and multi-threading capabilities of modern CPUs. Specifically, the optimization of HEAT is threefold: (1) It tiles the embedding matrix to increase data locality and reduce cache misses (thus reduces read latency); (2) It optimizes stochastic gradient descent (SGD) with sampling by parallelizing vector products instead of matrix-matrix multiplications, in particular the similarity computation therein, to avoid memory copies for matrix data preparation; and (3) It aggressively reuses intermediate results from the forward phase in the backward phase to alleviate redundant computation. Evaluation on five widely used datasets with both x86- and ARM-architecture processors shows that HEAT achieves up to 45.2X speedup over existing CPU solution and 4.5X speedup and 7.9X cost reduction in Cloud over existing GPU solution with NVIDIA V100 GPU.
翻译:协同过滤(CF)已被证明是最有效的推荐技术之一。在所有CF方法中,SimpleX是当前最先进的方法,它采用了新型损失函数和适当数量的负样本。然而,目前尚缺乏针对多核CPU优化SimpleX的研究,导致其性能受限。为此,我们对现有SimpleX实现进行了深入剖析,识别出三项性能瓶颈:(1)不规则内存访问,(2)冗余内存拷贝,(3)重复计算。针对这些问题,我们提出了一种高效CF训练系统(称为HEAT),该方案充分利用现代CPU的多级缓存与多线程能力。具体而言,HEAT的优化体现在三方面:(1)对嵌入矩阵进行分块处理以增强数据局部性并降低缓存未命中率(从而减少读取延迟);(2)通过并行化向量乘积(而非矩阵乘法)优化含采样的随机梯度下降(SGD),尤其针对其中的相似度计算,以避免矩阵数据准备时的内存拷贝;(3)在前向传播阶段主动复用中间结果用于反向传播,以减轻冗余计算。在采用x86与ARM架构处理器的五个广泛使用数据集上的评估表明,相比现有CPU方案,HEAT最高可实现45.2倍加速;相比搭载NVIDIA V100 GPU的现有GPU方案,HEAT在云端可实现4.5倍加速与7.9倍成本降低。