Training recommendation systems (RecSys) faces several challenges as it requires the "data preprocessing" stage to preprocess an ample amount of raw data and feed them to the GPU for training in a seamless manner. To sustain high training throughput, state-of-the-art solutions reserve a large fleet of CPU servers for preprocessing which incurs substantial deployment cost and power consumption. Our characterization reveals that prior CPU-centric preprocessing is bottlenecked on feature generation and feature normalization operations as it fails to reap out the abundant inter-/intra-feature parallelism in RecSys preprocessing. PreSto is a storage-centric preprocessing system leveraging In-Storage Processing (ISP), which offloads the bottlenecked preprocessing operations to our ISP units. We show that PreSto outperforms the baseline CPU-centric system with a $9.6\times$ speedup in end-to-end preprocessing time, $4.3\times$ enhancement in cost-efficiency, and $11.3\times$ improvement in energyefficiency on average for production-scale RecSys preprocessing.
翻译:推荐系统(RecSys)训练面临诸多挑战,其“数据预处理”阶段需对海量原始数据进行预处理,并以无缝方式馈送至GPU进行训练。为维持高训练吞吐量,现有先进方案需预留大量CPU服务器用于预处理,导致高昂的部署成本与功耗。我们的特征分析表明,先前以CPU为中心的预处理在特征生成与特征归一化操作上存在瓶颈,因其未能充分利用推荐系统预处理中丰富的特征间/特征内并行性。PreSto是基于存储的预处理系统,利用存内处理技术将瓶颈预处理操作卸载至我们设计的存内处理单元。实验表明,在生产级推荐系统预处理中,PreSto相较以CPU为中心的基线系统,端到端预处理时间平均加速9.6倍,成本效益提升4.3倍,能效改善11.3倍。