Large Language Models as Data Preprocessors

Large Language Models (LLMs), typified by OpenAI's GPT series and Meta's LLaMA variants, have marked a significant advancement in artificial intelligence. Trained on vast amounts of text data, LLMs are capable of understanding and generating human-like text across a diverse range of topics. This study expands on the applications of LLMs, exploring their potential in data preprocessing, a critical stage in data mining and analytics applications. We delve into the applicability of state-of-the-art LLMs such as GPT-3.5, GPT-4, and Vicuna-13B for error detection, data imputation, schema matching, and entity matching tasks. Alongside showcasing the inherent capabilities of LLMs, we highlight their limitations, particularly in terms of computational expense and inefficiency. We propose an LLM-based framework for data preprocessing, which integrates cutting-edge prompt engineering techniques, coupled with traditional methods like contextualization and feature selection, to improve the performance and efficiency of these models. The effectiveness of LLMs in data preprocessing is evaluated through an experimental study spanning 12 datasets. GPT-4 emerged as a standout, achieving 100\% accuracy or F1 score on 4 datasets, suggesting LLMs' immense potential in these tasks. Despite certain limitations, our study underscores the promise of LLMs in this domain and anticipates future developments to overcome current hurdles.

翻译：由OpenAI的GPT系列和Meta的LLaMA变体为代表的大型语言模型（LLMs），标志着人工智能领域的重大进步。基于海量文本数据训练，LLMs能够理解并生成涵盖多种主题的类人文本。本研究拓展了LLMs的应用范畴，探索其在数据预处理——数据挖掘与分析应用的关键环节中的潜力。我们深入考察了GPT-3.5、GPT-4、Vicuna-13B等最先进LLMs在错误检测、数据插补、模式匹配及实体匹配任务中的适用性。在展示LLMs固有优势的同时，我们重点揭示了其局限性，尤其是计算成本高昂与效率不足的问题。我们提出了一种基于LLM的数据预处理框架，该框架融合了前沿的提示工程技术，并结合上下文化与特征选择等传统方法，以提升模型的性能与效率。通过涵盖12个数据集的实验研究，评估了LLMs在数据预处理中的有效性。其中，GPT-4表现突出，在4个数据上实现了100%的准确率或F1分数，彰显了LLMs在这些任务中的巨大潜力。尽管存在特定局限，本研究强调了LLMs在该领域的应用前景，并预期未来将克服当前障碍。

相关内容

数据预处理

关注 1176

数据预处理（data preprocessing）是指在主要的处理以前对数据进行的一些处理。如对大部分地球物理面积性观测数据在进行转换或增强处理之前，首先将不规则分布的测网经过插值转换为规则网的处理，以利于计算机的运算。另外，对于一些剖面测量数据，如地震资料预处理有垂直叠加、重排、加道头、编辑、重新取样、多路编辑等。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日