The quality of training data are crucial for enhancing the long-text capabilities of foundation models. Despite existing efforts to refine data quality through heuristic rules and evaluations based on data diversity and difficulty, there's a lack of systematic approaches specifically tailored for assessing long texts. Addressing this gap, our work systematically measures the quality of long texts by evaluating three fundamental linguistic dimensions: coherence, cohesion, and complexity. Drawing inspiration from the aforementioned three dimensions, we introduce a suite of metrics designed to evaluate the quality of long texts, encompassing both statistical and pre-trained language model-based ones. Leveraging these metrics, we present LongWanjuan, a bilingual dataset specifically tailored to enhance the training of language models for long-text tasks with over 160B tokens. In LongWanjuan, we categorize long texts into holistic, aggregated, and chaotic types, enabling a detailed analysis of long-text quality. Furthermore, we devise a data mixture recipe that strategically balances different types of long texts within LongWanjuan, leading to significant improvements in model performance on long-text tasks. The code and dataset are available at https://github.com/OpenLMLab/LongWanjuan.
翻译:训练数据的质量对于提升基础模型的长文本能力至关重要。尽管已有研究通过启发式规则以及基于数据多样性与难度的评估来优化数据质量,但目前仍缺乏专门针对长文本评估的系统性方法。为填补这一空白,本文通过评估三个基本语言学维度——连贯性、衔接性与复杂性——系统性地度量了长文本质量。受上述三维度启发,我们提出了一套用于评估长文本质量的指标集,涵盖基于统计与基于预训练语言模型的指标。借助这些指标,我们构建了LongWanjuan数据集,这是一个专门用于增强语言模型长文本任务训练的语料库,包含超过1600亿个词元。在LongWanjuan中,我们将长文本划分为整体型、聚合型与混沌型三类,从而实现对长文本质量的详细分析。此外,我们设计了一种数据混合策略,在LongWanjuan中平衡不同类型长文本的比例,显著提升了模型在长文本任务上的性能。代码与数据集已发布在https://github.com/OpenLMLab/LongWanjuan。