Schema matching is a crucial task in data integration, involving the alignment of a source schema with a target schema to establish correspondence between their elements. This task is challenging due to textual and semantic heterogeneity, as well as differences in schema sizes. Although machine-learning-based solutions have been explored in numerous studies, they often suffer from low accuracy, require manual mapping of the schemas for model training, or need access to source schema data which might be unavailable due to privacy concerns. In this paper we present a novel method, named ReMatch, for matching schemas using retrieval-enhanced Large Language Models (LLMs). Our method avoids the need for predefined mapping, any model training, or access to data in the source database. Our experimental results on large real-world schemas demonstrate that ReMatch is an effective matcher. By eliminating the requirement for training data, ReMatch becomes a viable solution for real-world scenarios.
翻译:模式匹配是数据集成中的关键任务,涉及将源模式与目标模式对齐以建立其元素间的对应关系。由于文本与语义异构性以及模式规模的差异,该任务具有挑战性。尽管已有大量研究探索基于机器学习的解决方案,但这些方法通常存在准确率低、需要人工标注模式映射以进行模型训练,或需访问因隐私问题可能无法获取的源模式数据等问题。本文提出一种名为ReMatch的新方法,利用检索增强的大型语言模型进行模式匹配。该方法无需预定义映射、任何模型训练或访问源数据库中的数据。我们在大型真实世界模式上的实验结果表明,ReMatch是一种有效的匹配器。通过消除对训练数据的需求,ReMatch成为实际应用场景中可行的解决方案。