Column matching is a central task in reconciling schemas for data integration. Column names and descriptions are valuable for this task. LLMs can leverage such natural-language schema metadata. However, in many datasets, correct matching requires additional evidence beyond the column itself. Because it is impractical to provide an LLM with the entire schema metadata needed to capture this evidence, the core challenge becomes to select and organize the most useful contextual information. We present ConStruM, a structure-guided framework for budgeted evidence packing in schema matching. ConStruM constructs a lightweight, reusable structure in which, at query time, it assembles a small context pack emphasizing the most discriminative evidence. ConStruM is designed as an add-on: given a shortlist of candidate targets produced by an upstream matcher, it augments the matcher's final LLM prompt with structured, query-specific evidence so that the final selection is better grounded. For this purpose, we develop a context tree for budgeted multi-level context retrieval and a global similarity hypergraph that surfaces groups of highly similar columns (on both the source and target sides), summarized via group-aware differentiation cues computed online or precomputed offline. Experiments on real datasets show that ConStruM improves matching by providing and organizing the right contextual evidence.
翻译:列匹配是数据集成中协调模式的核心任务。列名与描述对此任务具有重要价值,大语言模型能够利用此类基于自然语言的模式元数据。然而在许多数据集中,正确匹配需要超越列本身之外的额外证据。由于向大语言模型提供捕获此类证据所需的全部模式元数据并不切实际,核心挑战在于筛选并组织最有用的上下文信息。我们提出ConStruM,一种面向模式匹配中预算约束证据打包的结构引导框架。ConStruM构建轻量级可复用结构,在查询时通过该结构组装包含最有区分性证据的紧凑上下文包。该框架设计为附加组件:给定上游匹配器生成的候选目标短列表后,它使用结构化、查询特定的证据增强匹配器的最终大语言模型提示,从而使最终选择更具依据。为此,我们开发了用于预算约束多级上下文检索的上下文树,以及全局相似性超图——该图显式呈现高度相似的列组(在源端与目标端),并通过在线计算或离线预计算的组感知差异化线索进行总结。真实数据集上的实验表明,ConStruM通过提供并组织正确的上下文证据提升了匹配性能。