QuRating: Selecting High-Quality Data for Training Language Models

Selecting high-quality pre-training data is important for creating capable language models, but existing methods rely on simple heuristics. We introduce QuRating, a method for selecting pre-training data that captures the abstract qualities of texts which humans intuitively perceive. In this paper, we investigate four qualities - writing style, required expertise, facts & trivia, and educational value. We find that LLMs are able to discern these qualities and observe that they are better at making pairwise judgments of texts than at rating the quality of a text directly. We train a QuRater model to learn scalar ratings from pairwise judgments, and use it to annotate a 260B training corpus with quality ratings for each of the four criteria. In our experiments, we select 30B tokens according to the different quality ratings and train 1.3B-parameter language models on the selected data. We find that it is important to balance quality and diversity, as selecting only the highest-rated documents leads to poor results. When we sample using quality ratings as logits over documents, our models achieve lower perplexity and stronger in-context learning performance than baselines. Beyond data selection, we use the quality ratings to construct a training curriculum which improves performance without changing the training dataset. We extensively analyze the quality ratings and discuss their characteristics, biases, and wider implications.

翻译：选择高质量的预训练数据对于构建高性能语言模型至关重要，但现有方法依赖简单启发式规则。我们提出QuRating方法，通过捕捉文本的抽象质量（人类直觉感知的维度）来筛选预训练数据。本文探究四种质量维度：写作风格、专业知识需求、事实与趣闻、教育价值。研究发现大语言模型能够识别这些质量维度，且在文本成对比较判断中的表现优于直接质量评分。我们训练QuRater模型从成对判断中学习标量评分，并用其标注包含260B token的训练语料库，为每项质量维度生成评分。实验中，我们根据四种质量评分各选取30B token，在筛选数据上训练1.3B参数语言模型。实验表明，平衡质量与多样性至关重要——仅选取最高评分文档会导致性能下降。当以质量评分为文档采样logits时，我们的模型在困惑度和上下文学习性能上均超越基线模型。除数据筛选外，我们利用质量评分构建训练课程，在不改变训练数据集的情况下提升模型性能。我们对质量评分进行深入分析，探讨其特征、偏差及更广泛的影响。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日