Heterogeneous Entity Matching with Complex Attribute Associations using BERT and Neural Networks

Across various domains, data from different sources such as Baidu Baike and Wikipedia often manifest in distinct forms. Current entity matching methodologies predominantly focus on homogeneous data, characterized by attributes that share the same structure and concise attribute values. However, this orientation poses challenges in handling data with diverse formats. Moreover, prevailing approaches aggregate the similarity of attribute values between corresponding attributes to ascertain entity similarity. Yet, they often overlook the intricate interrelationships between attributes, where one attribute may have multiple associations. The simplistic approach of pairwise attribute comparison fails to harness the wealth of information encapsulated within entities.To address these challenges, we introduce a novel entity matching model, dubbed Entity Matching Model for Capturing Complex Attribute Relationships(EMM-CCAR),built upon pre-trained models. Specifically, this model transforms the matching task into a sequence matching problem to mitigate the impact of varying data formats. Moreover, by introducing attention mechanisms, it identifies complex relationships between attributes, emphasizing the degree of matching among multiple attributes rather than one-to-one correspondences. Through the integration of the EMM-CCAR model, we adeptly surmount the challenges posed by data heterogeneity and intricate attribute interdependencies. In comparison with the prevalent DER-SSM and Ditto approaches, our model achieves improvements of approximately 4% and 1% in F1 scores, respectively. This furnishes a robust solution for addressing the intricacies of attribute complexity in entity matching.

翻译：跨领域数据（如百度百科与维基百科）常以不同形态呈现。现有实体匹配方法主要聚焦于同质数据，其特点是属性共享相同结构且属性值简洁。然而，这种取向在处理格式多样的数据时面临挑战。此外，主流方法通过比较对应属性间的属性值相似度来判定实体相似性，但常忽略属性间的复杂互关联——单一属性可能存在多重关联。这种简单的属性对比较方式未能充分利用实体所蕴含的丰富信息。为应对上述挑战，我们提出一种基于预训练模型的新型实体匹配模型——复杂属性关系捕捉实体匹配模型（EMM-CCAR）。具体而言，该模型将匹配任务转化为序列匹配问题，以缓解数据格式差异带来的影响。同时通过引入注意力机制，识别属性间的复杂关联，强调多属性间的匹配程度而非一一对应关系。通过集成EMM-CCAR模型，我们成功克服了数据异构性与属性间复杂依赖关系带来的挑战。与主流的DER-SSM和Ditto方法相比，本模型在F1分数上分别提升约4%和1%，为解决实体匹配中属性复杂性的难题提供了稳健方案。