Recent advances on deep learning models come at the price of formidable training cost. The increasing model size is one of the root causes, but another less-emphasized fact is that data scale is actually increasing at a similar speed as model scale, and the training cost is proportional to both of them. Compared to the rapidly evolving model architecture, how to efficiently use the training data (especially for the expensive foundation model pretraining) is both less explored and difficult to realize due to the lack of a convenient framework that focus on data efficiency capabilities. To this end, we present DeepSpeed Data Efficiency, a framework that makes better use of data, increases training efficiency, and improves model quality. Specifically, we propose and combine two novel data efficiency techniques: efficient data sampling via a general curriculum learning library, and efficient data routing via a novel random layerwise token dropping technique. DeepSpeed Data Efficiency also takes extensibility, flexibility and composability into consideration, so that users can easily utilize the framework to compose multiple techniques and apply customized strategies. By applying our solution to GPT-3 1.3B and BERT-large language model pretraining, we can achieve similar model quality with up to 2x less data and 2x less time, or achieve better model quality under similar amount of data and time.
翻译:近年来深度学习的进展以高昂的训练成本为代价。模型规模持续增长是根本原因之一,但另一个未被充分重视的事实是:数据规模实际上正以与模型规模相似的速度增长,而训练成本与两者均成正比。相较于快速演进的模型架构,如何高效利用训练数据(尤其是昂贵的基座模型预训练阶段)既缺乏深入探索,又因缺少专注于数据效率能力的便捷框架而难以实现。为此,我们提出DeepSpeed数据效率框架,该框架能更充分利用数据、提升训练效率并改善模型质量。具体而言,我们提出并融合了两项创新性数据效率技术:通过通用课程学习库实现的高效数据采样,以及通过新颖的随机层级令牌丢弃技术实现的高效数据路由。DeepSpeed数据效率框架同时考虑了可扩展性、灵活性和可组合性,使用户能够便捷地利用该框架组合多种技术并应用定制化策略。将该方案应用于GPT-3 1.3B和BERT-large语言模型预训练中,我们可以在数据量和时间均减少2倍的情况下获得相近的模型质量,或在同等数据量与时间下获得更优的模型质量。