LongWanjuan: Towards Systematic Measurement for Long Text Quality

The quality of training data are crucial for enhancing the long-text capabilities of foundation models. Despite existing efforts to refine data quality through heuristic rules and evaluations based on data diversity and difficulty, there's a lack of systematic approaches specifically tailored for assessing long texts. Addressing this gap, our work systematically measures the quality of long texts by evaluating three fundamental linguistic dimensions: coherence, cohesion, and complexity. Drawing inspiration from the aforementioned three dimensions, we introduce a suite of metrics designed to evaluate the quality of long texts, encompassing both statistical and pre-trained language model-based ones. Leveraging these metrics, we present LongWanjuan, a bilingual dataset specifically tailored to enhance the training of language models for long-text tasks with over 160B tokens. In LongWanjuan, we categorize long texts into holistic, aggregated, and chaotic types, enabling a detailed analysis of long-text quality. Furthermore, we devise a data mixture recipe that strategically balances different types of long texts within LongWanjuan, leading to significant improvements in model performance on long-text tasks. The code and dataset are available at https://github.com/OpenLMLab/LongWanjuan.

翻译：训练数据的质量对于提升基础模型的长文本能力至关重要。尽管现有研究通过启发式规则及基于数据多样性与难度的评估来优化数据质量，但尚缺乏专门针对长文本评估的系统性方法。为弥补这一空白，本文通过评估三个基本语言维度——连贯性、衔接性与复杂性——对长文本质量进行了系统性度量。受上述三个维度的启发，我们提出了一套用于评估长文本质量的指标集，涵盖基于统计和基于预训练语言模型两类方法。借助这些指标，我们构建了LongWanjuan，一个专为提升语言模型长文本任务训练而设计、包含超过1600亿词元（token）的双语数据集。在LongWanjuan中，我们将长文本分为整体型、聚合型和混沌型三种类别，从而实现对长文本质量的细致分析。此外，我们设计了一种数据混合策略，能够战略性地平衡LongWanjuan中不同类型的长文本，进而显著提升模型在长文本任务上的性能。代码与数据集已开源至https://github.com/OpenLMLab/LongWanjuan。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/