Distributed Deep Learning (DDL), as a paradigm, dictates the use of GPU-based clusters as the optimal infrastructure for training large-scale Deep Neural Networks (DNNs). However, the high cost of such resources makes them inaccessible to many users. Public cloud services, particularly Spot Virtual Machines (VMs), offer a cost-effective alternative, but their unpredictable availability poses a significant challenge to the crucial checkpointing process in DDL. To address this, we introduce DeepVM, a novel solution that recommends cost-effective cluster configurations by intelligently balancing the use of Spot and On-Demand VMs. DeepVM leverages a four-stage process that analyzes instance performance using the FLOPP (FLoating-point Operations Per Price) metric, performs architecture-level analysis with linear programming, and identifies the optimal configuration for the user-specific needs. Extensive simulations and real-world deployments in the AWS environment demonstrate that DeepVM consistently outperforms other policies, reducing training costs and overall makespan. By enabling cost-effective checkpointing with Spot VMs, DeepVM opens up DDL to a wider range of users and facilitates a more efficient training of complex DNNs.
翻译:分布式深度学习(DDL)作为一种范式,要求使用基于GPU的集群作为训练大规模深度神经网络(DNN)的最优基础设施。然而,此类资源的高昂成本使得许多用户难以获取。公有云服务,特别是竞价实例虚拟机(Spot VM),提供了成本效益更高的替代方案,但其不可预测的可用性对DDL中关键性的检查点机制构成重大挑战。为此,我们提出DeepVM——一种创新性解决方案,通过智能平衡竞价实例与按需实例的使用,推荐成本高效的集群配置。DeepVM采用四阶段流程:首先利用FLOPP(每价格浮点运算次数)度量分析实例性能,其次通过线性规划进行架构级分析,最终确定满足用户特定需求的最优配置。在AWS环境中的大量模拟与真实部署实验表明,DeepVM始终优于其他策略,有效降低训练成本与总执行时间。通过使用竞价实例实现经济高效的检查点机制,DeepVM使更广泛的用户群体能够运用DDL,并促进复杂DNN的高效训练。