Keeping ML-based recommender models up-to-date as data drifts and evolves is essential to maintain accuracy. As a result, online data preprocessing plays an increasingly important role in serving recommender systems. Existing solutions employ multiple CPU workers to saturate the input bandwidth of a single training node. Such an approach results in high deployment costs and energy consumption. For instance, a recent report from industrial deployments shows that data storage and ingestion pipelines can account for over 60\% of the power consumption in a recommender system. In this paper, we tackle the issue from a hardware perspective by introducing Piper, a flexible and network-attached accelerator that executes data loading and preprocessing pipelines in a streaming fashion. As part of the design, we define MiniPipe, the smallest pipeline unit enabling multi-pipeline implementation by executing various data preprocessing tasks across the single board, giving Piper the ability to be reconfigured at runtime. Our results, using publicly released commercial pipelines, show that Piper, prototyped on a power-efficient FPGA, achieves a 39$\sim$105$\times$ speedup over a server-grade, 128-core CPU and 3$\sim$17$\times$ speedup over GPUs like RTX 3090 and A100 in multiple pipelines. The experimental analysis demonstrates that Piper provides advantages in both latency and energy efficiency for preprocessing tasks in recommender systems, providing an alternative design point for systems that today are in very high demand.
翻译:随着数据漂移与演化,保持基于机器学习的推荐模型实时更新对于维持准确性至关重要。因此,在线数据预处理在推荐系统服务中扮演着日益重要的角色。现有解决方案采用多个CPU工作节点来饱和单个训练节点的输入带宽,这种方法导致高昂的部署成本与能源消耗。例如,工业部署的最新报告显示,在推荐系统中数据存储与注入流水线可占总功耗的60%以上。本文从硬件视角切入该问题,提出了Piper——一种灵活的网络附加加速器,能以流式方式执行数据加载与预处理流水线。作为设计核心,我们定义了MiniPipe这一最小流水线单元,通过在单板上执行多样化的数据预处理任务实现多流水线部署,使Piper具备运行时动态重构能力。基于公开的商业流水线测试表明,在高效能FPGA上原型实现的Piper,在多项流水线任务中相比服务器级128核CPU实现了39~105倍加速,较RTX 3090与A100等GPU实现了3~17倍加速。实验分析证明,Piper为推荐系统预处理任务在延迟与能效方面均提供显著优势,为当前高需求系统提供了一种创新的设计范式。