Jellyfish: A Large Language Model for Data Preprocessing

This paper explores the utilization of LLMs for data preprocessing (DP), a crucial step in the data mining pipeline that transforms raw data into a clean format conducive to easy processing. Whereas the use of LLMs has sparked interest in devising universal solutions to DP, recent initiatives in this domain typically rely on GPT APIs, raising inevitable data breach concerns. Unlike these approaches, we consider instruction-tuning local LLMs (7 - 13B models) as universal DP ask solver. We select a collection of datasets across four representative DP tasks and construct instruction-tuning data using serialization and knowledge injection techniques tailored to DP. As such, the instruction-tuned LLMs empower users to manually craft instructions for DP. Meanwhile, they can operate on a local, single, and low-priced GPU, ensuring data security and enabling further tuning. Our experiments show that our dataset constructed for DP instruction tuning, namely Jellyfish, effectively enhances LLMs' DP performances and barely compromises their abilities in NLP tasks. By tuning Mistral-7B and OpenOrca-Platypus2-13B with Jellyfish, the models deliver competitiveness compared to state-of-the-art DP methods and strong generalizability to unseen tasks. The models' performance rivals that of GPT series models, and the interpretation offers enhanced reasoning capabilities compared to GPT-3.5. The 7B and 13B Jellyfish models are available at Hugging Face: https://huggingface.co/NECOUDBFM/Jellyfish-7B https://huggingface.co/NECOUDBFM/Jellyfish-13B

翻译：本文探讨了将大型语言模型（LLM）用于数据预处理（DP）的方法，这是数据挖掘流程中的关键步骤，旨在将原始数据转化为易于处理的整洁格式。尽管LLM的应用激发了设计通用DP解决方案的兴趣，但该领域的最新尝试通常依赖GPT API，由此引发了不可避免的数据泄露隐患。与这些方法不同，我们考虑通过指令微调本地LLM（7B-13B模型）作为通用DP任务求解器。我们选取了覆盖四种典型DP任务的数据集集合，并采用针对DP定制的序列化与知识注入技术构建了指令微调数据。由此得到的指令微调LLM使能用户手动编写DP指令；同时，这些模型可在本地、单块且低成本的GPU上运行，保障数据安全并支持进一步微调。实验表明，我们为DP指令微调构建的数据集Jellyfish能有效提升LLM的DP性能，且几乎不损害其在自然语言处理任务中的能力。通过使用Jellyfish微调Mistral-7B与OpenOrca-Platypus2-13B，模型展现了与最先进DP方法相媲美的竞争力，并对未见任务具有强泛化能力。其性能与GPT系列模型相当，且相比GPT-3.5，模型解释还提供了增强的推理能力。7B与13B版本的Jellyfish模型可在Hugging Face获取：https://huggingface.co/NECOUDBFM/Jellyfish-7B 与 https://huggingface.co/NECOUDBFM/Jellyfish-13B

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日