Training deep learning models in the cloud or on dedicated hardware is expensive. A more cost-efficient option are hyperscale clouds offering spot instances, a cheap but ephemeral alternative to on-demand resources. As spot instance availability can change depending on the time of day, continent, and cloud provider, it could be more cost-efficient to distribute resources over the world. Still, it has not been investigated whether geo-distributed, data-parallel spot deep learning training could be a more cost-efficient alternative to centralized training. This paper aims to answer the question: Can deep learning models be cost-efficiently trained on a global market of spot VMs spanning different data centers and cloud providers? To provide guidance, we extensively evaluate the cost and throughput implications of training in different zones, continents, and clouds for representative CV and NLP models. To expand the current training options further, we compare the scalability potential for hybrid-cloud scenarios by adding cloud resources to on-premise hardware to improve training throughput. Finally, we show how leveraging spot instance pricing enables a new cost-efficient way to train models with multiple cheap VMs, trumping both more centralized and powerful hardware and even on-demand cloud offerings at competitive prices.
翻译:在云端或专用硬件上训练深度学习模型成本高昂。超大规模云平台提供的竞价实例是一种更具成本效益的选择,它是按需资源的廉价但临时的替代方案。由于竞价实例的可用性会随着一天中的时间、地理位置和云提供商而变化,因此将资源分布在世界各地可能更具成本效益。然而,目前尚未研究地理分布式数据并行竞价深度学习训练是否能成为集中式训练的更经济高效的替代方案。本文旨在回答一个问题:是否可以在覆盖不同数据中心和云提供商的全球竞价虚拟机市场上以高成本效益的方式训练深度学习模型?为提供指导,我们广泛评估了在不同区域、大陆和云平台上训练代表性计算机视觉和自然语言处理模型的成本及吞吐量影响。为进一步扩展当前训练选项,我们比较了混合云场景的可扩展性潜力,即通过向本地硬件添加云资源以提升训练吞吐量。最后,我们展示了如何利用竞价实例定价机制,通过多个廉价虚拟机实现一种新的高成本效益训练方式,在竞争力价格下超越更集中化的高性能硬件甚至按需云服务。