This paper explores the utilization of LLMs for data preprocessing (DP), a crucial step in the data mining pipeline that transforms raw data into a clean format conducive to easy processing. Whereas the use of LLMs has sparked interest in devising universal solutions to DP, recent initiatives in this domain typically rely on GPT APIs, raising inevitable data breach concerns. Unlike these approaches, we consider instruction-tuning local LLMs (7 -- 13B models) as universal DP task solvers that operate on a local, single, and low-priced GPU, ensuring data security and enabling further customization. We select a collection of datasets across four representative DP tasks and construct instruction tuning data using data configuration, knowledge injection, and reasoning data distillation techniques tailored to DP. By tuning Mistral-7B, Llama 3-8B, and OpenOrca-Platypus2-13B, our models, namely, Jellyfish-7B/8B/13B, deliver competitiveness compared to GPT-3.5/4 models and strong generalizability to unseen tasks while barely compromising the base models' abilities in NLP tasks. Meanwhile, Jellyfish offers enhanced reasoning capabilities compared to GPT-3.5. Our models are available at: https://huggingface.co/NECOUDBFM/Jellyfish . Our instruction dataset is available at: https://huggingface.co/datasets/NECOUDBFM/Jellyfish-Instruct .
翻译:本文探讨利用大语言模型(LLM)进行数据预处理(DP)——这是数据挖掘流程中将原始数据转换为易于处理的清洁格式的关键步骤。尽管LLM的应用激发了设计通用DP解决方案的兴趣,但该领域近期研究通常依赖GPT API,不可避免地引发数据泄露隐患。与这些方法不同,我们采用指令微调本地LLM(7-13B参数量模型)作为通用DP任务求解器,使其能在本地单张低成本GPU上运行,既保障数据安全又支持深度定制。我们选取涵盖四大代表性DP任务的数据集,通过数据配置、知识注入及面向DP的推理数据蒸馏技术构建指令微调数据集。通过对Mistral-7B、Llama 3-8B和OpenOrca-Platypus2-13B进行微调,所得模型Jellyfish-7B/8B/13B在性能上可与GPT-3.5/4模型竞争,对未见任务展现出强大泛化能力,且几乎不影响基础模型在NLP任务中的原始能力。同时,Jellyfish相比GPT-3.5具备更强的推理能力。模型发布于:https://huggingface.co/NECOUDBFM/Jellyfish。指令数据集发布于:https://huggingface.co/datasets/NECOUDBFM/Jellyfish-Instruct。