Training Through Failure: Effects of Data Consistency in Parallel Machine Learning Training

In this study, we explore the impact of relaxing data consistency in parallel machine learning training during a failure using various parameter server configurations. Our failure recovery strategies include traditional checkpointing, chain replication (which ensures a backup server takes over in case of failure), and a novel stateless parameter server approach. In the stateless approach, workers continue generating gradient updates even if the parameter server is down, applying these updates once the server is back online. We compare these techniques to a standard checkpointing approach, where the training job is resumed from the latest checkpoint. To assess the resilience and performance of each configuration, we intentionally killed the parameter server during training for each experiment. Our experiment results indicate that the stateless parameter server approach continues to train towards convergence and improves accuracy as much as 10\% in the face of a failure despite using stale weights and gradients. The chain replication and checkpointing techniques demonstrate convergence but suffer from setbacks in accuracy due to restarting from old checkpoints. These results suggest that allowing workers to continue generating updates during server downtime and applying these updates later can effectively improve hardware utilization. Furthermore, despite higher resource usage, the stateless parameter server method incurs similar monetary costs in terms of hardware usage compared to standard checkpointing methods due to the pricing structure of common cloud providers.

翻译：本研究探讨了在使用不同参数服务器配置时，并行机器学习训练在故障情况下放松数据一致性所带来的影响。我们的故障恢复策略包括传统的检查点机制、链式复制（确保备份服务器在故障时接管），以及一种新颖的无状态参数服务器方法。在无状态方法中，即使参数服务器宕机，工作节点也会继续生成梯度更新，并在服务器重新上线后应用这些更新。我们将这些技术与标准的检查点方法进行了比较，后者从最近的检查点恢复训练任务。为评估每种配置的韧性和性能，我们在每次实验期间故意杀死参数服务器。实验结果表明，尽管使用了过时的权重和梯度，无状态参数服务器方法仍能持续向收敛方向训练，并在面对故障时准确率提升高达10%。链式复制和检查点技术虽能收敛，但由于从旧检查点重新启动，准确率会有所下降。这些结果说明，允许工作节点在服务器宕机期间继续生成更新并在之后应用这些更新，可以有效提高硬件利用率。此外，尽管资源使用率更高，但由于常见云提供商的定价结构，无状态参数服务器方法在硬件使用方面的货币成本与标准检查点方法相似。

相关内容

服务器

关注 14

服务器，也称伺服器，是提供计算服务的设备。由于服务器需要响应服务请求，并进行处理，因此一般来说服务器应具备承担服务并且保障服务的能力。
服务器的构成包括处理器、硬盘、内存、系统总线等，和通用的计算机架构类似，但是由于需要提供高可靠的服务，因此在处理能力、稳定性、可靠性、安全性、可扩展性、可管理性等方面要求较高。

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日