LongWanjuan: Towards Systematic Measurement for Long Text Quality

The quality of training data are crucial for enhancing the long-text capabilities of foundation models. Despite existing efforts to refine data quality through heuristic rules and evaluations based on data diversity and difficulty, there's a lack of systematic approaches specifically tailored for assessing long texts. Addressing this gap, our work systematically measures the quality of long texts by evaluating three fundamental linguistic dimensions: coherence, cohesion, and complexity. Drawing inspiration from the aforementioned three dimensions, we introduce a suite of metrics designed to evaluate the quality of long texts, encompassing both statistical and pre-trained language model-based ones. Leveraging these metrics, we present LongWanjuan, a bilingual dataset specifically tailored to enhance the training of language models for long-text tasks with over 160B tokens. In LongWanjuan, we categorize long texts into holistic, aggregated, and chaotic types, enabling a detailed analysis of long-text quality. Furthermore, we devise a data mixture recipe that strategically balances different types of long texts within LongWanjuan, leading to significant improvements in model performance on long-text tasks. The code and dataset are available at https://github.com/OpenLMLab/LongWanjuan.

翻译：训练数据的质量对于提升基础模型的长文本能力至关重要。尽管已有研究通过启发式规则以及基于数据多样性与难度的评估来优化数据质量，但目前仍缺乏专门针对长文本评估的系统性方法。为填补这一空白，本文通过评估三个基本语言学维度——连贯性、衔接性与复杂性——系统性地度量了长文本质量。受上述三维度启发，我们提出了一套用于评估长文本质量的指标集，涵盖基于统计与基于预训练语言模型的指标。借助这些指标，我们构建了LongWanjuan数据集，这是一个专门用于增强语言模型长文本任务训练的语料库，包含超过1600亿个词元。在LongWanjuan中，我们将长文本划分为整体型、聚合型与混沌型三类，从而实现对长文本质量的详细分析。此外，我们设计了一种数据混合策略，在LongWanjuan中平衡不同类型长文本的比例，显著提升了模型在长文本任务上的性能。代码与数据集已发布在https://github.com/OpenLMLab/LongWanjuan。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/