Contextual Graph Embeddings: Accounting for Data Characteristics in Heterogeneous Data Integration

As organizations continue to access diverse datasets, the demand for effective data integration has increased. Key tasks in this process, such as schema matching and entity resolution, are essential but often require significant effort. Although previous studies have aimed to automate these tasks, the influence of dataset characteristics on the matching effectiveness has not been thoroughly examined, and combinations of different methods remain limited. This study introduces a contextual graph embedding technique that integrates structural details from tabular data and contextual elements such as column descriptions and external knowledge. Tests conducted on datasets with varying properties such as domain specificity, data size, missing rate, and overlap rate showed that our approach consistently surpassed existing graph-based methods, especially in difficult scenarios such those with a high proportion of numerical values or significant missing data. However, we identified specific failure cases, such as columns that were semantically similar but distinct, which remains a challenge for our method. The study highlights two main insights: (i) contextual embeddings enhance the matching reliability, and (ii) dataset characteristics significantly affect the integration outcomes. These contributions can advance the development of practical data integration systems that can support real-world enterprise applications.

翻译：随着组织持续获取多样化数据集，对有效数据集成需求日益增长。该过程中的关键任务，如模式匹配和实体解析，虽至关重要但通常需耗费大量人力。尽管先前研究致力于自动化这些任务，但数据集特征对匹配效果的影响尚未得到充分探究，且不同方法的组合应用仍显有限。本研究提出一种上下文图嵌入技术，该技术融合了表格数据的结构细节以及列描述、外部知识等上下文要素。在具有不同属性（如领域特异性、数据规模、缺失率和重叠率）的数据集上进行测试表明，我们的方法始终优于现有基于图的方法，尤其在数值比例高或数据缺失严重的困难场景中表现突出。然而，我们也识别出特定失败案例，例如语义相似但本质不同的列，这仍是本方法面临的挑战。本研究强调两个主要发现：（i）上下文嵌入提升了匹配可靠性；（ii）数据集特征显著影响集成结果。这些贡献可推动实用数据集成系统的发展，以支持现实企业应用。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【CVPR2025】概念护卫：具备遗忘与混淆缓解机制的持续个性化文本生成图像方法

专知会员服务

8+阅读 · 2025年4月17日

【CVPR2025】重新思考长时视频理解中的时序检索

专知会员服务

13+阅读 · 2025年4月6日

【ICML2023】SEGA:结构熵引导的图对比学习锚视图

专知会员服务

24+阅读 · 2023年5月10日

【CVPR2023】DynamicDet:目标检测的统一动态架构

专知会员服务

26+阅读 · 2023年4月15日