DeepSpeed Data Efficiency: Improving Deep Learning Model Quality and Training Efficiency via Efficient Data Sampling and Routing

from arxiv, Equal contribution by the first 3 authors. Code has been released as a part of https://github.com/microsoft/DeepSpeed. Part of this paper is from our previous arxiv report (arXiv:2211.11586)

Recent advances on deep learning models come at the price of formidable training cost. The increasing model size is one of the root causes, but another less-emphasized fact is that data scale is actually increasing at a similar speed as model scale, and the training cost is proportional to both of them. Compared to the rapidly evolving model architecture, how to efficiently use the training data (especially for the expensive foundation model pretraining) is both less explored and difficult to realize due to the lack of a convenient framework that focus on data efficiency capabilities. To this end, we present DeepSpeed Data Efficiency, a framework that makes better use of data, increases training efficiency, and improves model quality. Specifically, we propose and combine two novel data efficiency techniques: efficient data sampling via a general curriculum learning library, and efficient data routing via a novel random layerwise token dropping technique. DeepSpeed Data Efficiency also takes extensibility, flexibility and composability into consideration, so that users can easily utilize the framework to compose multiple techniques and apply customized strategies. By applying our solution to GPT-3 1.3B and BERT-large language model pretraining, we can achieve similar model quality with up to 2x less data and 2x less time, or achieve better model quality under similar amount of data and time.

翻译：近年来深度学习的进展以高昂的训练成本为代价。模型规模持续增长是根本原因之一，但另一个未被充分重视的事实是：数据规模实际上正以与模型规模相似的速度增长，而训练成本与两者均成正比。相较于快速演进的模型架构，如何高效利用训练数据（尤其是昂贵的基座模型预训练阶段）既缺乏深入探索，又因缺少专注于数据效率能力的便捷框架而难以实现。为此，我们提出DeepSpeed数据效率框架，该框架能更充分利用数据、提升训练效率并改善模型质量。具体而言，我们提出并融合了两项创新性数据效率技术：通过通用课程学习库实现的高效数据采样，以及通过新颖的随机层级令牌丢弃技术实现的高效数据路由。DeepSpeed数据效率框架同时考虑了可扩展性、灵活性和可组合性，使用户能够便捷地利用该框架组合多种技术并应用定制化策略。将该方案应用于GPT-3 1.3B和BERT-large语言模型预训练中，我们可以在数据量和时间均减少2倍的情况下获得相近的模型质量，或在同等数据量与时间下获得更优的模型质量。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/