In this study, we explore the impact of relaxing data consistency in parallel machine learning training during a failure using various parameter server configurations. Our failure recovery strategies include traditional checkpointing, chain replication (which ensures a backup server takes over in case of failure), and a novel stateless parameter server approach. In the stateless approach, workers continue generating gradient updates even if the parameter server is down, applying these updates once the server is back online. We compare these techniques to a standard checkpointing approach, where the training job is resumed from the latest checkpoint. To assess the resilience and performance of each configuration, we intentionally killed the parameter server during training for each experiment. Our experiment results indicate that the stateless parameter server approach continues to train towards convergence and improves accuracy as much as 10\% in the face of a failure despite using stale weights and gradients. The chain replication and checkpointing techniques demonstrate convergence but suffer from setbacks in accuracy due to restarting from old checkpoints. These results suggest that allowing workers to continue generating updates during server downtime and applying these updates later can effectively improve hardware utilization. Furthermore, despite higher resource usage, the stateless parameter server method incurs similar monetary costs in terms of hardware usage compared to standard checkpointing methods due to the pricing structure of common cloud providers.
翻译:本研究探讨了在使用不同参数服务器配置时,并行机器学习训练在故障情况下放松数据一致性所带来的影响。我们的故障恢复策略包括传统的检查点机制、链式复制(确保备份服务器在故障时接管),以及一种新颖的无状态参数服务器方法。在无状态方法中,即使参数服务器宕机,工作节点也会继续生成梯度更新,并在服务器重新上线后应用这些更新。我们将这些技术与标准的检查点方法进行了比较,后者从最近的检查点恢复训练任务。为评估每种配置的韧性和性能,我们在每次实验期间故意杀死参数服务器。实验结果表明,尽管使用了过时的权重和梯度,无状态参数服务器方法仍能持续向收敛方向训练,并在面对故障时准确率提升高达10%。链式复制和检查点技术虽能收敛,但由于从旧检查点重新启动,准确率会有所下降。这些结果说明,允许工作节点在服务器宕机期间继续生成更新并在之后应用这些更新,可以有效提高硬件利用率。此外,尽管资源使用率更高,但由于常见云提供商的定价结构,无状态参数服务器方法在硬件使用方面的货币成本与标准检查点方法相似。