Language models have shown promise in various tasks but can be affected by undesired data during training, fine-tuning, or alignment. For example, if some unsafe conversations are wrongly annotated as safe ones, the model fine-tuned on these samples may be harmful. Therefore, the correctness of annotations, i.e., the credibility of the dataset, is important. This study focuses on the credibility of real-world datasets, including the popular benchmarks Jigsaw Civil Comments, Anthropic Harmless & Red Team, PKU BeaverTails & SafeRLHF, that can be used for training a harmless language model. Given the cost and difficulty of cleaning these datasets by humans, we introduce a systematic framework for evaluating the credibility of datasets, identifying label errors, and evaluating the influence of noisy labels in the curated language data, specifically focusing on unsafe comments and conversation classification. With the framework, we find and fix an average of 6.16% label errors in 11 datasets constructed from the above benchmarks. The data credibility and downstream learning performance can be remarkably improved by directly fixing label errors, indicating the significance of cleaning existing real-world datasets. We provide an open-source tool, Docta, for data cleaning at https://github.com/Docta-ai/docta.
翻译:语言模型在各种任务中展现出潜力,但可能在训练、微调或对齐过程中受到不良数据的影响。例如,若某些不安全对话被错误标注为安全样本,基于这些数据微调的模型可能产生有害输出。因此,标注的正确性,即数据集的可信度,至关重要。本研究聚焦于可用于训练无害语言模型的实际数据集的可信度,包括Jigsaw Civil Comments、Anthropic Harmless & Red Team、PKU BeaverTails & SafeRLHF等流行基准。考虑到人工清洗这些数据的成本和难度,我们引入了一个系统性框架,用于评估数据集可信度、识别标签错误,并评估噪声标签对精心整理的语言数据的影响,特别关注不安全评论与对话分类任务。利用该框架,我们在上述基准构建的11个数据集中发现并修复了平均6.16%的标签错误。通过直接修正标签错误,数据可信度与下游学习性能得到显著提升,这凸显了清洗现有实际数据集的重要性。我们提供了开源工具Docta用于数据清洗,地址为https://github.com/Docta-ai/docta。