Instruction tuning has become an integral part of training pipelines for Large Language Models (LLMs) and has been shown to yield strong performance gains. In an orthogonal line of research, Annotation Error Detection (AED) has emerged as a tool for detecting quality problems in gold standard labels. So far, however, the application of AED methods has been limited to classification tasks. It is an open question how well AED methods generalize to language generation settings, which are becoming more widespread via LLMs. In this paper, we present a first and novel benchmark for AED on instruction tuning data: DONKII. It comprises three instruction-tuning datasets enriched with error annotations by experts and semi-automatic methods. We also provide a novel taxonomy of error types for instruction-tuning data. We find that all three datasets contain clear errors, which sometimes propagate directly into instruction-tuned LLMs. We propose four AED baselines for the generative setting and evaluate them extensively on the newly introduced dataset. Our results show that the choice of the right AED method and model size is indeed crucial and derive practical recommendations for how to use AED methods to clean instruction-tuning data.
翻译:指令微调已成为大型语言模型(LLMs)训练流程中不可或缺的组成部分,并被证明能带来显著的性能提升。在另一条研究路径上,标注错误检测(AED)作为检测黄金标准标签质量问题的一种工具而出现。然而,迄今为止,AED方法的适用仍局限于分类任务。对于AED方法如何泛化至语言生成场景(随着LLMs的普及,此类场景日益普遍),仍是一个开放性问题。本文提出了首个面向指令微调数据中AED的新型基准:DONKII。该基准包含三个指令微调数据集,这些数据集经专家和半自动方法增补了错误标注。我们还为指令微调数据提供了一种新颖的错误类型分类体系。结果发现,三个数据集均存在明显错误,其中部分错误会直接传播至经指令微调的LLMs中。我们针对生成场景提出了四种AED基线方法,并在新引入的数据集上对其进行了广泛评估。结果表明,正确选择AED方法和模型规模至关重要,并据此得出了利用AED方法清理指令微调数据的实用建议。