Entity Matching (EM) involves identifying different data representations referring to the same entity from multiple data sources and is typically formulated as a binary classification problem. It is a challenging problem in data integration due to the heterogeneity of data representations. State-of-the-art solutions have adopted NLP techniques based on pre-trained language models (PrLMs) via the fine-tuning paradigm, however, sequential fine-tuning of overparameterized PrLMs can lead to catastrophic forgetting, especially in low-resource scenarios. In this study, we propose a parameter-efficient paradigm for fine-tuning PrLMs based on adapters, small neural networks encapsulated between layers of a PrLM, by optimizing only the adapter and classifier weights while the PrLMs parameters are frozen. Adapter-based methods have been successfully applied to multilingual speech problems achieving promising results, however, the effectiveness of these methods when applied to EM is not yet well understood, particularly for generalized EM with heterogeneous data. Furthermore, we explore using (i) pre-trained adapters and (ii) invertible adapters to capture token-level language representations and demonstrate their benefits for transfer learning on the generalized EM benchmark. Our results show that our solution achieves comparable or superior performance to full-scale PrLM fine-tuning and prompt-tuning baselines while utilizing a significantly smaller computational footprint $\approx 13\%$ of the PrLM parameters.
翻译:实体匹配涉及识别来自多个数据源中指向同一实体的不同数据表示,通常被表述为二分类问题。由于数据表示的异质性,这是数据集成中的一个具有挑战性的问题。现有最先进的解决方案采用基于预训练语言模型的自然语言处理技术,通过微调范式实现,然而,针对过参数化预训练语言模型进行顺序微调可能导致灾难性遗忘,尤其在低资源场景下。本研究提出一种基于适配器的参数高效微调范式,适配器作为封装在预训练语言模型层间的小型神经网络,通过仅优化适配器和分类器权重(同时冻结预训练语言模型参数)实现。基于适配器的方法已成功应用于多语言语音问题并取得了有前景的结果,然而这些方法应用于实体匹配时的有效性尚未得到充分理解,特别是针对具有异构数据的通用实体匹配。此外,我们探索使用(i)预训练适配器和(ii)可逆适配器捕获词级语言表示,并证明其对通用实体匹配基准上迁移学习的益处。结果表明,我们的解决方案在显著降低计算开销(约占预训练语言模型参数的13%)的同时,达到了与全规模预训练语言模型微调和提示微调基线相当甚至更优的性能。