Jellyfish: A Large Language Model for Data Preprocessing

This paper explores the utilization of LLMs for data preprocessing (DP), a crucial step in the data mining pipeline that transforms raw data into a clean format conducive to easy processing. Whereas the use of LLMs has sparked interest in devising universal solutions to DP, recent initiatives in this domain typically rely on GPT APIs, raising inevitable data breach concerns. Unlike these approaches, we consider instruction-tuning local LLMs (7 -- 13B models) as universal DP task solvers that operate on a local, single, and low-priced GPU, ensuring data security and enabling further customization. We select a collection of datasets across four representative DP tasks and construct instruction tuning data using data configuration, knowledge injection, and reasoning data distillation techniques tailored to DP. By tuning Mistral-7B, Llama 3-8B, and OpenOrca-Platypus2-13B, our models, namely, Jellyfish-7B/8B/13B, deliver competitiveness compared to GPT-3.5/4 models and strong generalizability to unseen tasks while barely compromising the base models' abilities in NLP tasks. Meanwhile, Jellyfish offers enhanced reasoning capabilities compared to GPT-3.5. Our models are available at: https://huggingface.co/NECOUDBFM/Jellyfish . Our instruction dataset is available at: https://huggingface.co/datasets/NECOUDBFM/Jellyfish-Instruct .

翻译：本文探讨利用大语言模型（LLM）进行数据预处理（DP）——这是数据挖掘流程中将原始数据转换为易于处理的清洁格式的关键步骤。尽管LLM的应用激发了设计通用DP解决方案的兴趣，但该领域近期研究通常依赖GPT API，不可避免地引发数据泄露隐患。与这些方法不同，我们采用指令微调本地LLM（7-13B参数量模型）作为通用DP任务求解器，使其能在本地单张低成本GPU上运行，既保障数据安全又支持深度定制。我们选取涵盖四大代表性DP任务的数据集，通过数据配置、知识注入及面向DP的推理数据蒸馏技术构建指令微调数据集。通过对Mistral-7B、Llama 3-8B和OpenOrca-Platypus2-13B进行微调，所得模型Jellyfish-7B/8B/13B在性能上可与GPT-3.5/4模型竞争，对未见任务展现出强大泛化能力，且几乎不影响基础模型在NLP任务中的原始能力。同时，Jellyfish相比GPT-3.5具备更强的推理能力。模型发布于：https://huggingface.co/NECOUDBFM/Jellyfish。指令数据集发布于：https://huggingface.co/datasets/NECOUDBFM/Jellyfish-Instruct。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日