It is widely acknowledged that large models have the potential to deliver superior performance across a broad range of domains. Despite the remarkable progress made in the field of machine learning systems research, which has enabled the development and exploration of large models, such abilities remain confined to a small group of advanced users and industry leaders, resulting in an implicit technical barrier for the wider community to access and leverage these technologies. In this paper, we introduce PyTorch Fully Sharded Data Parallel (FSDP) as an industry-grade solution for large model training. FSDP has been closely co-designed with several key PyTorch core components including Tensor implementation, dispatcher system, and CUDA memory caching allocator, to provide non-intrusive user experiences and high training efficiency. Additionally, FSDP natively incorporates a range of techniques and settings to optimize resource utilization across a variety of hardware configurations. The experimental results demonstrate that FSDP is capable of achieving comparable performance to Distributed Data Parallel while providing support for significantly larger models with near-linear scalability in terms of TFLOPS.
翻译:大型模型在广泛领域内展现卓越性能的潜力已得到广泛认可。尽管机器学习系统研究领域取得了显著进展,使得大型模型的开发与探索成为可能,但这些能力仍局限于少数高级用户和行业领导者,对更广泛社区获取和利用这些技术构成了隐性技术壁垒。本文介绍了PyTorch全分片数据并行(FSDP),这是一项面向大型模型训练的工业级解决方案。FSDP与PyTorch多个核心组件(包括张量实现、调度器系统和CUDA内存缓存分配器)进行了紧密协同设计,以提供非侵入式用户体验和高训练效率。此外,FSDP原生集成了多种技术和配置,可在不同硬件配置下优化资源利用率。实验结果表明,FSDP能够实现与分布式数据并行相当的性能,同时支持显著更大的模型,并在TFLOPS指标上展现出近线性扩展能力。