The quality of training data are crucial for enhancing the long-text capabilities of foundation models. Despite existing efforts to refine data quality through heuristic rules and evaluations based on data diversity and difficulty, there's a lack of systematic approaches specifically tailored for assessing long texts. Addressing this gap, our work systematically measures the quality of long texts by evaluating three fundamental linguistic dimensions: coherence, cohesion, and complexity. Drawing inspiration from the aforementioned three dimensions, we introduce a suite of metrics designed to evaluate the quality of long texts, encompassing both statistical and pre-trained language model-based ones. Leveraging these metrics, we present LongWanjuan, a bilingual dataset specifically tailored to enhance the training of language models for long-text tasks with over 160B tokens. In LongWanjuan, we categorize long texts into holistic, aggregated, and chaotic types, enabling a detailed analysis of long-text quality. Furthermore, we devise a data mixture recipe that strategically balances different types of long texts within LongWanjuan, leading to significant improvements in model performance on long-text tasks. The code and dataset are available at https://github.com/OpenLMLab/LongWanjuan.
翻译:训练数据的质量对于提升基础模型的长文本能力至关重要。尽管现有研究通过启发式规则及基于数据多样性与难度的评估来优化数据质量,但尚缺乏专门针对长文本评估的系统性方法。为弥补这一空白,本文通过评估三个基本语言维度——连贯性、衔接性与复杂性——对长文本质量进行了系统性度量。受上述三个维度的启发,我们提出了一套用于评估长文本质量的指标集,涵盖基于统计和基于预训练语言模型两类方法。借助这些指标,我们构建了LongWanjuan,一个专为提升语言模型长文本任务训练而设计、包含超过1600亿词元(token)的双语数据集。在LongWanjuan中,我们将长文本分为整体型、聚合型和混沌型三种类别,从而实现对长文本质量的细致分析。此外,我们设计了一种数据混合策略,能够战略性地平衡LongWanjuan中不同类型的长文本,进而显著提升模型在长文本任务上的性能。代码与数据集已开源至https://github.com/OpenLMLab/LongWanjuan。