DTT: An Example-Driven Tabular Transformer for Joinability by Leveraging Large Language Models

Many organizations rely on data from government and third-party sources, and those sources rarely follow the same data formatting. This introduces challenges in integrating data from multiple sources or aligning external sources with internal databases. Commercial database systems do not offer adequate support for integrating data from heterogeneous sources, and manual integration is both time-consuming and inefficient. State-of-the-art data integration approaches that rely on similarity functions and textual transformations often fail to handle challenging cases where multiple mappings are required, or the mappings go beyond simple textual transformations. In this paper, we study the potentials of deep neural models for transforming tables for joinability. In particular, we cast the problem as a prediction task and develop a framework that leverages large deep-learning language models to transform tabular data from a source formatting to a desired target representation. Our framework can efficiently learn the patterns for mapping a source formatting into an expected target using just a few examples, which can then be used for tasks such as table joining, filling in missing values, and error detection. Compared to state-of-the-art mapping and joining approaches, our framework delivers noticeably more accurate and scalable performance on both real-world and synthetic datasets. Our experimental evaluation also shows that the performance of the proposed framework using our fine-tuned model is at par or better than large language models such as GPT-3, despite the significant difference in size, and that using large language models within our framework improves their performance.

翻译：许多组织依赖政府和第三方来源的数据，但这些来源很少遵循相同的数据格式。这给多源数据整合或外部数据与内部数据库的对齐带来了挑战。商业数据库系统对异构数据源的集成支持不足，而人工整合既耗时又低效。现有依赖相似度函数和文本变换的最先进数据集成方法，往往无法应对需要多重映射或映射超出简单文本变换的复杂情况。本文研究了深度神经模型在表变换以实现可连接性方面的潜力。具体而言，我们将该问题建模为预测任务，并开发了一个利用大型深度学习语言模型将表格数据从源格式转换为目标表示的框架。该框架能够仅通过少量示例高效学习源格式到目标格式的映射模式，进而应用于表连接、缺失值填充和错误检测等任务。与最先进的映射和连接方法相比，该框架在真实数据集与合成数据集上均展现出更显著的准确性和可扩展性。实验评估还表明，尽管模型规模存在显著差异，但采用我们微调模型的框架性能与GPT-3等大型语言模型相当甚至更优，且在该框架内使用大型语言模型可进一步提升其性能。