Many organizations rely on data from government and third-party sources, and those sources and organizations do not follow the same data formatting. This introduces challenges in integrating data from multiple sources. Commercial database systems do not offer adequate support for integrating data from heterogeneous sources, and manual integration is both time-consuming and inefficient. While state-of-the-art approaches rely on similarity functions and textual transformations, they often fail to handle challenging cases where multiple mappings are required, or the mappings go beyond simple textual transformations. In this paper, we study the potential of deep neural models for transforming tables for joinability. In particular, we cast the problem as a prediction task and develop a framework that leverages large deep-learning language models to transform tabular data from a source formatting to a desired target representation. Our framework can efficiently learn the pattern for mapping the source formatting into the expected target using just a few examples, which can then be used for table joining, filling in missing values, and error detection. Compared to state-of-the-art mapping and joining approaches, our framework delivers noticeably more accurate and scalable performance on both real-world and synthetic datasets. Our experimental evaluation also shows that the performance of the proposed framework using our fine-tuned model is at par or better than large language models such as GPT-3, despite the significant difference in size, and that integrating large language models into our framework improves their performance.
翻译:许多组织依赖来自政府和第三方来源的数据,但这些来源与组织之间并未遵循相同的数据格式,这给多源数据整合带来了挑战。商业数据库系统无法为异构数据源整合提供充分支持,而手动整合既耗时又低效。尽管现有先进方法依赖相似度函数和文本转换,但它们往往难以处理需要多重映射或映射超出简单文本转换的复杂情况。本文研究了深度神经网络模型在实现表格可连接性转换中的潜力。具体而言,我们将该问题建模为预测任务,并开发了一个利用大型深度学习语言模型将表格数据从源格式转换为所需目标表示的框架。该框架仅需少量示例即可高效学习源格式到目标格式的映射模式,进而应用于表格连接、缺失值填充和错误检测。与现有最先进的映射与连接方法相比,我们的框架在真实数据集和合成数据集上均展现出更显著准确性与可扩展性。实验评估还表明,尽管模型规模存在显著差异,采用我们微调模型所提出框架的性能与GPT-3等大型语言模型相当甚至更优,同时将大型语言模型集成到框架中可进一步提升其性能。