Selecting high-quality pre-training data is important for creating capable language models, but existing methods rely on simple heuristics. We introduce QuRating, a method for selecting pre-training data that captures the abstract qualities of texts which humans intuitively perceive. In this paper, we investigate four qualities - writing style, required expertise, facts & trivia, and educational value. We find that LLMs are able to discern these qualities and observe that they are better at making pairwise judgments of texts than at rating the quality of a text directly. We train a QuRater model to learn scalar ratings from pairwise judgments, and use it to annotate a 260B training corpus with quality ratings for each of the four criteria. In our experiments, we select 30B tokens according to the different quality ratings and train 1.3B-parameter language models on the selected data. We find that it is important to balance quality and diversity, as selecting only the highest-rated documents leads to poor results. When we sample using quality ratings as logits over documents, our models achieve lower perplexity and stronger in-context learning performance than baselines. Beyond data selection, we use the quality ratings to construct a training curriculum which improves performance without changing the training dataset. We extensively analyze the quality ratings and discuss their characteristics, biases, and wider implications.
翻译:选择高质量的预训练数据对于构建高性能语言模型至关重要,但现有方法依赖简单启发式规则。我们提出QuRating方法,通过捕捉文本的抽象质量(人类直觉感知的维度)来筛选预训练数据。本文探究四种质量维度:写作风格、专业知识需求、事实与趣闻、教育价值。研究发现大语言模型能够识别这些质量维度,且在文本成对比较判断中的表现优于直接质量评分。我们训练QuRater模型从成对判断中学习标量评分,并用其标注包含260B token的训练语料库,为每项质量维度生成评分。实验中,我们根据四种质量评分各选取30B token,在筛选数据上训练1.3B参数语言模型。实验表明,平衡质量与多样性至关重要——仅选取最高评分文档会导致性能下降。当以质量评分为文档采样logits时,我们的模型在困惑度和上下文学习性能上均超越基线模型。除数据筛选外,我们利用质量评分构建训练课程,在不改变训练数据集的情况下提升模型性能。我们对质量评分进行深入分析,探讨其特征、偏差及更广泛的影响。