Deep recommendation systems (DRS) heavily depend on specialized HPC hardware and accelerators to optimize energy, efficiency, and recommendation quality. Despite the growing number of hardware errors observed in large-scale fleet systems where DRS are deployed, the robustness of DRS has been largely overlooked. This paper presents the first systematic study of DRS robustness against hardware errors. We develop Terrorch, a user-friendly, efficient and flexible error injection framework on top of the widely-used PyTorch. We evaluate a wide range of models and datasets and observe that the DRS robustness against hardware errors is influenced by various factors from model parameters to input characteristics. We also explore 3 error mitigation methods including algorithm based fault tolerance (ABFT), activation clipping and selective bit protection (SBP). We find that applying activation clipping can recover up to 30% of the degraded AUC-ROC score, making it a promising mitigation method.
翻译:深度推荐系统(DRS)高度依赖专用高性能计算硬件和加速器,以优化能耗、效率和推荐质量。尽管在部署DRS的大规模集群系统中观察到的硬件错误数量日益增多,但DRS的鲁棒性在很大程度上被忽视。本文首次系统研究了DRS对硬件错误的鲁棒性。我们开发了Terrorch——一个基于广泛使用的PyTorch框架、用户友好、高效且灵活的误差注入工具。通过评估多种模型和数据集,我们发现DRS对硬件错误的鲁棒性受从模型参数到输入特征等多种因素影响。此外,我们探索了三种误差缓解方法,包括基于算法的容错(ABFT)、激活值裁剪和选择性位保护(SBP)。实验发现,应用激活值裁剪可恢复高达30%的退化AUC-ROC分数,使其成为一种有前景的缓解方法。