Data is one of the most critical elements in building a large language model. However, existing systems either fail to customize a corpus curation pipeline or neglect to leverage comprehensive corpus assessment for iterative optimization of the curation. To this end, we present a pretraining corpus curation and assessment platform called Oasis -- a one-stop system for data quality improvement and quantification with user-friendly interactive interfaces. Specifically, the interactive modular rule filter module can devise customized rules according to explicit feedback. The debiased neural filter module builds the quality classification dataset in a negative-centric manner to remove the undesired bias. The adaptive document deduplication module could execute large-scale deduplication with limited memory resources. These three parts constitute the customized data curation module. And in the holistic data assessment module, a corpus can be assessed in local and global views, with three evaluation means including human, GPT-4, and heuristic metrics. We exhibit a complete process to use Oasis for the curation and assessment of pretraining data. In addition, an 800GB bilingual corpus curated by Oasis is publicly released.
翻译:数据是构建大语言模型最关键的要素之一。然而现有系统要么无法定制语料库策展流程,要么忽视利用全面语料评估对策展过程进行迭代优化。为此,我们提出名为Oasis的预训练语料策展与评估平台——通过用户友好的交互式界面实现数据质量提升与量化的"一站式"系统。具体而言,交互式模块化规则过滤器可根据显式反馈制定自定义规则;去偏神经过滤器采用负向优先方式构建质量分类数据集以消除不良偏差;自适应文档去重模块能在有限内存资源下执行大规模去重。上述三部分构成定制化数据策展模块。在整体数据评估模块中,可从局部与全局视角通过人工评估、GPT-4评估及启发式指标三种评价手段对语料进行评估。本文完整展示了使用Oasis进行预训练数据策展与评估的全流程,并公开了由Oasis策展的800GB双语语料库。