Large Language Models (LLMs), typified by OpenAI's GPT, have marked a significant advancement in artificial intelligence. Trained on vast amounts of text data, LLMs are capable of understanding and generating human-like text across a diverse range of topics. This study expands on the applications of LLMs, exploring their potential in data preprocessing, a critical stage in data mining and analytics applications. Aiming at tabular data, we delve into the applicability of state-of-the-art LLMs such as GPT-4 and GPT-4o for a series of preprocessing tasks, including error detection, data imputation, schema matching, and entity matching. Alongside showcasing the inherent capabilities of LLMs, we highlight their limitations, particularly in terms of computational expense and inefficiency. We propose an LLM-based framework for data preprocessing, which integrates cutting-edge prompt engineering techniques, coupled with traditional methods like contextualization and feature selection, to improve the performance and efficiency of these models. The effectiveness of LLMs in data preprocessing is evaluated through an experimental study spanning a variety of public datasets. GPT-4 emerged as a standout, achieving 100\% accuracy or F1 score on 4 of these datasets, suggesting LLMs' immense potential in these tasks. Despite certain limitations, our study underscores the promise of LLMs in this domain and anticipates future developments to overcome current hurdles.
翻译:以OpenAI的GPT为代表的大型语言模型(LLMs)标志着人工智能领域的重大进展。通过在海量文本数据上进行训练,LLMs能够理解和生成涵盖广泛主题的类人文本。本研究拓展了LLMs的应用范围,探讨了其在数据挖掘与分析应用中关键阶段——数据预处理方面的潜力。针对表格数据,我们深入研究了GPT-4和GPT-4o等前沿LLMs在一系列预处理任务中的适用性,包括错误检测、数据填补、模式匹配和实体匹配。在展示LLMs固有能力的同时,我们特别指出了其局限性,尤其是在计算成本与效率方面。我们提出了一个基于LLM的数据预处理框架,该框架整合了前沿的提示工程技术,并结合语境化与特征选择等传统方法,以提升这些模型的性能与效率。通过覆盖多个公共数据集的实验研究,我们评估了LLMs在数据预处理中的有效性。GPT-4表现尤为突出,在4个数据集上实现了100%的准确率或F1分数,这表明LLMs在此类任务中具有巨大潜力。尽管存在某些局限,我们的研究强调了LLMs在该领域的应用前景,并展望了未来克服当前障碍的发展方向。