The integration of tabular data from diverse sources is often hindered by inconsistencies in formatting and representation, posing significant challenges for data analysts and personal digital assistants. Existing methods for automating tabular data transformations are limited in scope, often focusing on specific types of transformations or lacking interpretability. In this paper, we introduce TabulaX, a novel framework that leverages Large Language Models (LLMs) for multi-class tabular transformations. TabulaX first classifies input tables into four transformation classes (string-based, numerical, algorithmic, and general) and then applies tailored methods to generate human-interpretable transformation functions, such as numeric formulas or programming code. This approach enhances transparency and allows users to understand and modify the mappings. Through extensive experiments on real-world datasets from various domains, we demonstrate that TabulaX outperforms existing state-of-the-art approaches in terms of accuracy, supports a broader class of transformations, and generates interpretable transformations that can be efficiently applied.
翻译:整合来自不同来源的表格数据常因格式与表示方式的不一致而受阻,这给数据分析师和个人数字助手带来了重大挑战。现有的表格数据自动化转换方法在适用范围上存在局限,通常仅关注特定类型的转换或缺乏可解释性。本文提出TabulaX——一种利用大型语言模型(LLMs)实现多类别表格转换的新型框架。TabulaX首先将输入表格分类为四种转换类别(基于字符串、数值型、算法型和通用型),随后应用定制化方法生成人类可解释的转换函数(例如数值公式或程序代码)。该方法增强了透明度,使用户能够理解并修改映射关系。通过对多领域真实数据集的广泛实验,我们证明TabulaX在准确性方面优于现有最先进方法,支持更广泛的转换类别,并能生成可高效执行的可解释转换。