Large Language Models (LLMs), typified by OpenAI's GPT series and Meta's LLaMA variants, have marked a significant advancement in artificial intelligence. Trained on vast amounts of text data, LLMs are capable of understanding and generating human-like text across a diverse range of topics. This study expands on the applications of LLMs, exploring their potential in data preprocessing, a critical stage in data mining and analytics applications. We delve into the applicability of state-of-the-art LLMs such as GPT-3.5, GPT-4, and Vicuna-13B for error detection, data imputation, schema matching, and entity matching tasks. Alongside showcasing the inherent capabilities of LLMs, we highlight their limitations, particularly in terms of computational expense and inefficiency. We propose an LLM-based framework for data preprocessing, which integrates cutting-edge prompt engineering techniques, coupled with traditional methods like contextualization and feature selection, to improve the performance and efficiency of these models. The effectiveness of LLMs in data preprocessing is evaluated through an experimental study spanning 12 datasets. GPT-4 emerged as a standout, achieving 100\% accuracy or F1 score on 4 datasets, suggesting LLMs' immense potential in these tasks. Despite certain limitations, our study underscores the promise of LLMs in this domain and anticipates future developments to overcome current hurdles.
翻译:由OpenAI的GPT系列和Meta的LLaMA变体为代表的大型语言模型(LLMs),标志着人工智能领域的重大进步。基于海量文本数据训练,LLMs能够理解并生成涵盖多种主题的类人文本。本研究拓展了LLMs的应用范畴,探索其在数据预处理——数据挖掘与分析应用的关键环节中的潜力。我们深入考察了GPT-3.5、GPT-4、Vicuna-13B等最先进LLMs在错误检测、数据插补、模式匹配及实体匹配任务中的适用性。在展示LLMs固有优势的同时,我们重点揭示了其局限性,尤其是计算成本高昂与效率不足的问题。我们提出了一种基于LLM的数据预处理框架,该框架融合了前沿的提示工程技术,并结合上下文化与特征选择等传统方法,以提升模型的性能与效率。通过涵盖12个数据集的实验研究,评估了LLMs在数据预处理中的有效性。其中,GPT-4表现突出,在4个数据上实现了100%的准确率或F1分数,彰显了LLMs在这些任务中的巨大潜力。尽管存在特定局限,本研究强调了LLMs在该领域的应用前景,并预期未来将克服当前障碍。