Large-scale distributed model training requires simultaneous training on up to thousands of machines. Faulty machine detection is critical when an unexpected fault occurs in a machine. From our experience, a training task can encounter two faults per day on average, possibly leading to a halt for hours. To address the drawbacks of the time-consuming and labor-intensive manual scrutiny, we propose Minder, an automatic faulty machine detector for distributed training tasks. The key idea of Minder is to automatically and efficiently detect faulty distinctive monitoring metric patterns, which could last for a period before the entire training task comes to a halt. Minder has been deployed in our production environment for over one year, monitoring daily distributed training tasks where each involves up to thousands of machines. In our real-world fault detection scenarios, Minder can accurately and efficiently react to faults within 3.6 seconds on average, with a precision of 0.904 and F1-score of 0.893.
翻译:大规模分布式模型训练需要在多达数千台机器上同时进行。当机器发生意外故障时,故障机器检测至关重要。根据我们的经验,一项训练任务平均每天可能遇到两次故障,可能导致训练中断数小时。为克服耗时费力的人工排查的弊端,我们提出了Minder,一种用于分布式训练任务的自动故障机器检测器。Minder的核心思想是自动且高效地检测出异常的监控指标模式,这些模式可能在整项训练任务完全停止前持续一段时间。Minder已在我们的生产环境中部署超过一年,每日监控涉及多达数千台机器的分布式训练任务。在实际故障检测场景中,Minder能够准确高效地响应故障,平均响应时间为3.6秒,精确率达0.904,F1分数为0.893。