ATOM: Asynchronous Training of Massive Models for Deep Learning in a Decentralized Environment

The advent of the Transformer architecture has propelled the growth of natural language processing (NLP) models, leading to remarkable achievements in numerous NLP tasks. Yet, the absence of specialized hardware like expansive GPU memory and high-speed interconnects poses challenges for training large-scale models. This makes it daunting for many users to experiment with pre-training and fine-tuning large language models (LLMs). In this study, we introduce \atom, a resilient distributed training framework designed for asynchronous training of vast models in a decentralized setting using cost-effective hardware, including consumer-grade GPUs and Ethernet. Unlike conventional model partitioning methods that distribute sub-models across GPUs, \atom aims to accommodate a complete LLM on one host (peer) through seamlessly model swapping and concurrently trains multiple copies across various peers to optimize training throughput. Through static analysis, \atom identifies the best model partitioning strategy and flawlessly merges model execution with swapping. Key benefits of \atom include: Avoiding the central point of failure found in pipeline parallelism methods. Demonstrating superior performance and scalability compared to closely-integrated pipeline parallelism in slower networks. Our experiments using different GPT-3 model configurations reveal that, in scenarios with suboptimal network connections, \atom can enhance training efficiency up to $20 \times$ when juxtaposed with the state-of-the-art decentralized pipeline parallelism approaches.

翻译：Transformer架构的出现推动了自然语言处理（NLP）模型的发展，并在众多NLP任务中取得了显著成就。然而，由于缺乏专用硬件（如大容量GPU内存和高速互连），大规模模型的训练面临挑战，这使得许多用户难以尝试大语言模型（LLMs）的预训练与微调。在本研究中，我们提出ATOM——一种弹性分布式训练框架，旨在利用包括消费级GPU和以太网在内的低成本硬件，在去中心化环境中实现大规模模型的异步训练。与将子模型分布到各GPU的传统模型划分方法不同，ATOM通过无缝模型交换将完整的大语言模型适配到单个主机（节点）上，并在多个节点间并发训练多个副本以优化训练吞吐量。通过静态分析，ATOM能够识别最优模型划分策略，并将模型执行与交换完美融合。ATOM的核心优势包括：避免流水线并行方法中的单点故障问题；在低速网络中展现出相比紧密集成的流水线并行更优越的性能与可扩展性。我们使用不同GPT-3模型配置进行的实验表明，在网络连接欠佳的场景下，相较于最先进的去中心化流水线并行方法，ATOM可将训练效率提升高达$20 \times$。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日