The personalized recommendation system's continuous size growth poses new challenges for model inference. Although weight-sharing algorithms have been proposed to reduce embedding table capacity, they increase memory access. Recent advancements in processing-in-memory (PIM) successfully enhance the recommendation system's throughput by exploiting memory parallelism, but our analysis shows that those algorithms introduce CPU-PIM communication overhead into prior PIM systems, compromising the PIM throughput. We propose ProactivePIM, a specialized memory architecture integrated with PIM technology tailored to accelerate the weight-sharing algorithms. ProacitvePIM integrates an SRAM cache within the PIM with an efficient prefetching scheme to leverage a unique locality of the algorithm and eliminate CPU-PIM communication.
翻译:个性化推荐系统规模的持续增长给模型推理带来了新的挑战。尽管已有权重共享算法被提出以减少嵌入表容量,但它们增加了内存访问量。近期存内处理(PIM)技术的进展通过利用内存并行性成功提升了推荐系统的吞吐量,但我们的分析表明,这些算法在先前的PIM系统中引入了CPU-PIM通信开销,从而影响了PIM的吞吐性能。我们提出ProactivePIM,一种与PIM技术集成的专用内存架构,旨在加速权重共享算法。ProactivePIM在PIM内部集成SRAM缓存,并采用高效的预取方案,以利用该算法特有的局部性并消除CPU-PIM通信。