This paper explores the utilization of LLMs for data preprocessing (DP), a crucial step in the data mining pipeline that transforms raw data into a clean format conducive to easy processing. Whereas the use of LLMs has sparked interest in devising universal solutions to DP, recent initiatives in this domain typically rely on GPT APIs, raising inevitable data breach concerns. Unlike these approaches, we consider instruction-tuning local LLMs (7 - 13B models) as universal DP ask solver. We select a collection of datasets across four representative DP tasks and construct instruction-tuning data using serialization and knowledge injection techniques tailored to DP. As such, the instruction-tuned LLMs empower users to manually craft instructions for DP. Meanwhile, they can operate on a local, single, and low-priced GPU, ensuring data security and enabling further tuning. Our experiments show that our dataset constructed for DP instruction tuning, namely Jellyfish, effectively enhances LLMs' DP performances and barely compromises their abilities in NLP tasks. By tuning Mistral-7B and OpenOrca-Platypus2-13B with Jellyfish, the models deliver competitiveness compared to state-of-the-art DP methods and strong generalizability to unseen tasks. The models' performance rivals that of GPT series models, and the interpretation offers enhanced reasoning capabilities compared to GPT-3.5. The 7B and 13B Jellyfish models are available at Hugging Face: https://huggingface.co/NECOUDBFM/Jellyfish-7B https://huggingface.co/NECOUDBFM/Jellyfish-13B
翻译:本文探讨了将大型语言模型(LLM)用于数据预处理(DP)的方法,这是数据挖掘流程中的关键步骤,旨在将原始数据转化为易于处理的整洁格式。尽管LLM的应用激发了设计通用DP解决方案的兴趣,但该领域的最新尝试通常依赖GPT API,由此引发了不可避免的数据泄露隐患。与这些方法不同,我们考虑通过指令微调本地LLM(7B-13B模型)作为通用DP任务求解器。我们选取了覆盖四种典型DP任务的数据集集合,并采用针对DP定制的序列化与知识注入技术构建了指令微调数据。由此得到的指令微调LLM使能用户手动编写DP指令;同时,这些模型可在本地、单块且低成本的GPU上运行,保障数据安全并支持进一步微调。实验表明,我们为DP指令微调构建的数据集Jellyfish能有效提升LLM的DP性能,且几乎不损害其在自然语言处理任务中的能力。通过使用Jellyfish微调Mistral-7B与OpenOrca-Platypus2-13B,模型展现了与最先进DP方法相媲美的竞争力,并对未见任务具有强泛化能力。其性能与GPT系列模型相当,且相比GPT-3.5,模型解释还提供了增强的推理能力。7B与13B版本的Jellyfish模型可在Hugging Face获取:https://huggingface.co/NECOUDBFM/Jellyfish-7B 与 https://huggingface.co/NECOUDBFM/Jellyfish-13B