Automated Data Curation for Robust Language Model Fine-Tuning

Large Language Models have become the de facto approach to sequence-to-sequence text generation tasks, but for specialized tasks/domains, a pretrained LLM lacks specific capabilities to produce accurate or well-formatted responses. Supervised fine-tuning specializes a LLM by training it on dataset of example prompts with target responses, but real-world data tends to be noisy. While many fine-tuning algorithms exist, here we consider a \emph{data-centric AI} perspective on LLM fine-tuning, studying how to \emph{systematically} curate the training dataset to improve the LLM produced via \emph{any} fine-tuning algorithm. We introduce an automated data curation pipeline CLEAR (Confidence-based LLM Evaluation And Rectification) for instruction tuning datasets, that can be used with any LLM and fine-tuning procedure. CLEAR estimates which training data is low-quality and either filters or corrects it. Automatically identifying which data to filter or correct is done via LLM-derived confidence estimates, to ensure only confident modifications to the dataset. Unlike existing data curation techniques, CLEAR is a comprehensive framework that can improve a dataset (and trained model outputs) without additional fine-tuning computations. We don't assume access to a stronger LLM than the model being fine-tuned (e.g.\ relying on GPT-4 when fine-tuning GPT-3.5), to see whether CLEAR can meaningfully improve the capabilities of any LLM. Experiments reveal that CLEAR consistently improves the performance of fine-tuned models across many datasets and models (like GPT-3.5 and Llama2).

翻译：大型语言模型已成为序列到序列文本生成任务的事实标准方法，但在特定任务或领域上，预训练的语言模型缺乏产生准确或格式良好的响应所需的具体能力。监督微调通过在包含示例提示和目标响应的数据集上训练来专门化语言模型，但真实世界的数据往往存在噪声。尽管存在多种微调算法，本文从数据中心的视角研究语言模型的微调，探讨如何系统地策展训练数据集以改进通过任何微调算法得到的语言模型。我们提出了一种名为CLEAR（基于置信度的语言模型评估与修正）的自动化数据策展流水线，适用于指令微调数据集，可与任何语言模型及微调过程配合使用。CLEAR评估哪些训练数据质量较低，并对其进行过滤或修正。自动识别需要过滤或修正的数据通过语言模型导出的置信度估计实现，以确保仅对数据集进行可信的修改。与现有数据策展技术不同，CLEAR是一个无需额外微调计算即可改进数据集（及训练后模型输出）的综合框架。我们假设在测试CLEAR能否有效提升任意语言模型能力时，无法访问强于被微调模型的语言模型（例如微调GPT-3.5时依赖GPT-4）。实验表明，CLEAR在多个数据集和模型（如GPT-3.5和Llama2）上持续提升了微调模型的性能。

相关内容

大语言模型

关注 66

大语言模型是基于海量文本数据训练的深度学习模型。它不仅能够生成自然语言文本，还能够深入理解文本含义，处理各种自然语言任务，如文本摘要、问答、翻译等。2023年，大语言模型及其在人工智能领域的应用已成为全球科技研究的热点，其在规模上的增长尤为引人注目，参数量已从最初的十几亿跃升到如今的一万亿。参数量的提升使得模型能够更加精细地捕捉人类语言微妙之处，更加深入地理解人类语言的复杂性。在过去的一年里，大语言模型在吸纳新知识、分解复杂任务以及图文对齐等多方面都有显著提升。随着技术的不断成熟，它将不断拓展其应用范围，为人类提供更加智能化和个性化的服务，进一步改善人们的生活和生产方式。

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日