Schema matching is a fundamental step in integrating heterogeneous data sources. While Pre-trained Language Models (PLMs) have revolutionized this task by capturing linguistic semantics, they typically process tabular data as serialized text sequences of standalone column descriptions. This serialization discards critical structural information -- specifically, the row-level co-occurrences, i.e. the relational context -- forcing models to rely solely on column header semantics or standalone distributions. To bridge this gap, we propose SemStruct, a framework that joins the semantic power of frozen PLMs with the structural inductive bias of Graph Neural Networks (GNNs). We model the table as a heterogeneous graph where columns and values are nodes connected by rows, allowing the GNN to propagate disambiguating context across the structure. Unlike other state-of-the-art methods that require proprietary LLM access and fine-tuning of language models, SemStruct keeps the language model frozen and trains only a lightweight structural encoder. Extensive experiments on the Valentine and SOTAB-SM benchmarks demonstrate that SemStruct achieves state-of-the-art performance, outperforming fully fine-tuned baselines on complex, semantically joinable datasets. Furthermore, our ablation studies reveal that row representations serve primarily as topological conduits rather than semantic entities, validating the necessity of explicit structural modeling in schema matching.
翻译:模式匹配是整合异构数据源的基础步骤。虽然预训练语言模型通过捕获语言语义革新了这一任务,但它们通常将表格数据视为独立列描述序列化后的文本序列。这种序列化丢弃了关键的结构信息——具体而言,是行级共现(即关系上下文),迫使模型仅依赖列标题语义或独立分布。为弥补这一差距,我们提出SemStruct框架,它将冻结的预训练语言模型的语义能力与图神经网络的归纳偏置相结合。我们将表格建模为异构图,其中列和值作为节点,通过行连接,使得图神经网络能够在整个结构上传播消歧上下文。与需要专有大型语言模型访问和语言模型微调的其他先进方法不同,SemStruct保持语言模型冻结,仅训练一个轻量级结构编码器。在Valentine和SOTAB-SM基准上的大量实验表明,SemStruct实现了最先进的性能,在复杂、可语义连接的 datasets 上优于完全微调的基线模型。此外,我们的消融研究表明,行表示主要作为拓扑通道而非语义实体,这验证了在模式匹配中显式结构建模的必要性。