The real-time performance of recommender models depends on the continuous integration of massive volumes of new user interaction data into training pipelines. While GPUs have scaled model training throughput, the data preprocessing stage - commonly expressed as Extract-Transform-Load (ETL) pipelines - has emerged as the dominant bottleneck. Production systems often dedicate clusters of CPU servers to support a single GPU node, leading to high operational cost. To address this issue, we present PipeRec, a hardware-accelerated ETL engine co-designed with online recommender model training. PipeRec introduces a training-aware ETL abstraction that exposes freshness, ordering, and batching semantics while compiling software-defined operators into reconfigurable FPGA dataflows and overlaps ETL with GPU training to maximize utilization under I/O constraints. To eliminate CPU bottlenecks, PipeRec implements a format-aware packer that streams training-ready batches directly into GPU memory via P2P DMA transfers, enabling zero-copy ingest and efficient GPU consumption. Our evaluation on three datasets shows that PipeRec accelerates ETL throughput by over 10x compared to CPU-based pipelines and up to 17x over state-of-the-art GPU ETL systems. When integrated with training, PipeRec maintains 64-91% GPU utilization and reduces end-to-end training time to 9.94% of the time taken by CPU-GPU pipelines.
翻译:推荐模型的实时性能取决于海量新用户交互数据在训练流水线中的持续集成。虽然GPU已提升了模型训练的吞吐量,但数据预处理阶段——通常表现为提取-转换-加载(ETL)流水线——已成为主要瓶颈。生产系统通常需要配置多台CPU服务器集群来支持单个GPU节点,导致高昂的运维成本。为解决此问题,我们提出了PipeRec,一种与在线推荐模型训练协同设计的硬件加速ETL引擎。PipeRec引入了一种训练感知的ETL抽象,该抽象在将软件定义的算子编译为可重构FPGA数据流的同时,显式地暴露了数据新鲜度、顺序和批处理语义,并通过ETL与GPU训练的重叠执行,在I/O约束下最大化硬件利用率。为消除CPU瓶颈,PipeRec实现了一种格式感知的打包器,通过点对点(P2P)DMA传输将训练就绪的数据批次直接流式写入GPU内存,从而实现零拷贝数据摄取和高效的GPU数据消费。我们在三个数据集上的评估表明,与基于CPU的流水线相比,PipeRec将ETL吞吐量加速了10倍以上;与最先进的GPU ETL系统相比,加速比最高可达17倍。当与训练过程集成时,PipeRec能维持64-91%的GPU利用率,并将端到端训练时间缩短至CPU-GPU流水线所需时间的9.94%。