Pivot tables are ubiquitous in data lakes of modern data ecosystems, making accurate schema matching over pivot tables a key prerequisite for data integration. In this paper, we focus on matching for pivot table schema, which is a novel joint schema-value matching task. It aims to align schemas between pivot tables and standard relational tables, where a correct match must be semantically consistent at the schema level and compatible at the value level. However, due to the inherent data sensitivity of this task, the prevalence of anonymized data in practice poses significant challenges to its matching accuracy and generalization capability. To tackle these challenges, we propose PiLLar, the first matching for pivot table schema framework. We first formulate PiLLar as an LLM-driven search paradigm that operates with minimal annotated privacy-compliant data, thereby achieving training-free adaptation across diverse domains. Next, we provide a theoretical analysis on the error dynamics of the paradigm to ensure the asymptotic convergence of the proposed method. Furthermore, we introduce a new benchmark PTbench, derived from four representative real-world domains and constructed by mining unpivot-suitable tables, performing unpivot on semantically coherent attributes, and applying sampling and anonymization. Extensive experiments demonstrate the superiority of PiLLar, which achieves an average accuracy of 87.94% on the correctly predicted matches.
翻译:透视表在现代数据生态系统的数据湖中无处不在,使得透视表结构匹配成为数据整合的关键前提。本文聚焦于透视表结构匹配这一新型联合结构-值匹配任务,旨在对齐透视表与标准关系表间的结构,要求正确匹配在结构层面语义一致且值层面兼容。然而,该任务具有内在数据敏感性,实践中匿名化数据的普遍性对其匹配精度与泛化能力构成重大挑战。为此,我们提出首个透视表结构匹配框架PiLLar。首先将PiLLar建模为基于LLM驱动的搜索范式,以极小标注隐私合规数据运行,实现跨不同领域的无训练自适应。其次,对该范式的误差动态进行理论分析,确保所提方法的渐近收敛性。最后,引入基于四个典型真实领域构建的新基准PTbench,其通过挖掘适合逆透视的表、对语义一致属性执行逆透视操作并进行采样与匿名化生成。大量实验证明PiLLar的优越性,其在正确预测匹配上的平均准确率达87.94%。