Traditional methods for identifying structurally similar spreadsheets fail to capture the spatial layouts and type patterns defining templates. To quantify spreadsheet similarity, we introduce a hybrid distance metric that combines semantic embeddings, data type information, and spatial positioning. In order to calculate spreadsheet similarity, our method converts spreadsheets into cell-level embeddings and then uses aggregation techniques like Chamfer and Hausdorff distances. Experiments across template families demonstrate superior unsupervised clustering performance compared to the graph-based Mondrian baseline, achieving perfect template reconstruction (Adjusted Rand Index of 1.00 versus 0.90) on the FUSTE dataset. Our approach facilitates large-scale automated template discovery, which in turn enables downstream applications such as retrieval-augmented generation over tabular collections, model training, and bulk data cleaning.
翻译:传统识别结构相似电子表格的方法难以捕捉定义模板的空间布局与类型模式。为量化电子表格相似性,我们提出一种融合语义嵌入、数据类型信息与空间位置的混合距离度量方法。该方法将电子表格转换为单元格级嵌入,继而采用Chamfer距离与Hausdorff距离等聚合技术计算相似度。在多个模板族上的实验表明,相较于基于图的Mondrian基线方法,本方法在无监督聚类任务中表现更优,在FUSTE数据集上实现了完美的模板重构(调整兰德指数达1.00,对比基线0.90)。本方法支持大规模自动化模板发现,进而赋能下游应用,包括表格集合的检索增强生成、模型训练及批量数据清洗。