The quality of training data impacts the performance of pre-trained large language models (LMs). Given a fixed budget of tokens, we study how to best select data that leads to good downstream model performance across tasks. We develop a new framework based on a simple hypothesis: just as humans acquire interdependent skills in a deliberate order, language models also follow a natural order when learning a set of skills from their training data. If such an order exists, it can be utilized for improved understanding of LMs and for data-efficient training. Using this intuition, our framework formalizes the notion of a skill and of an ordered set of skills in terms of the associated data. First, using both synthetic and real data, we demonstrate that these ordered skill sets exist, and that their existence enables more advanced skills to be learned with less data when we train on their prerequisite skills. Second, using our proposed framework, we introduce an online data sampling algorithm, Skill-It, over mixtures of skills for both continual pre-training and fine-tuning regimes, where the objective is to efficiently learn multiple skills in the former and an individual skill in the latter. On the LEGO synthetic in the continual pre-training setting, Skill-It obtains 36.5 points higher accuracy than random sampling. On the Natural Instructions dataset in the fine-tuning setting, Skill-It reduces the validation loss on the target skill by 13.6% versus training on data associated with the target skill itself. We apply our skills framework on the recent RedPajama dataset to continually pre-train a 3B-parameter LM, achieving higher accuracy on the LM Evaluation Harness with 1B tokens than the baseline approach of sampling uniformly over data sources with 3B tokens.
翻译:训练数据的质量影响预训练大型语言模型(LM)的性能。在固定token预算下,我们研究如何最优地选择数据以获得跨任务的下游模型性能。我们基于一个简单假设开发新框架:正如人类按刻意顺序获取相互依赖的技能,语言模型在从训练数据中学习一组技能时也会遵循自然顺序。若此类顺序存在,便可利用它来更深入地理解LM并实现数据高效训练。基于这一直觉,我们的框架形式化了技能的概念,以及基于关联数据的有序技能集。首先,我们通过合成数据和真实数据证明这些有序技能集存在,且其存在使得在预修技能上训练后,更高级技能能用更少数据学习。其次,利用所提框架,我们引入一种在线数据采样算法Skill-It,该算法针对技能混合用于持续预训练和微调两种场景,前者目标是高效学习多项技能,后者则聚焦于学习单一技能。在持续预训练设置的LEGO合成数据上,Skill-It的准确率比随机采样高36.5个百分点。在微调场景的Natural Instructions数据集上,与目标技能关联数据的训练相比,Skill-It将目标技能的验证损失降低13.6%。我们将技能框架应用于最新的RedPajama数据集,持续预训练一个3B参数的LM,使用1B token即在LM Evaluation Harness上达到比基线方法(用3B token在数据源上均匀采样)更高的准确率。