Shape-Agnostic Table Overlap Discovery: A Maximum Common Subhypergraph Approach - 专知论文

会员服务 ·

0

行 · 列 · 剪枝 · 可理解性 · 可辨认的 ·

Shape-Agnostic Table Overlap Discovery: A Maximum Common Subhypergraph Approach

翻译：暂无翻译

Ge Lee,Shixun Huang,Zhifeng Bao,Felix Naumann,Shazia Sadiq,Yanchang Zhao

from arxiv, Technical report of a paper accepted to SIGMOD 2026

Understanding how two tables overlap is useful for many data management tasks, but challenging because tables often differ in row and column orders and lack reliable metadata in practice. Prior work defines the largest rectangular overlap, which identifies the maximal contiguous region of matching cells under row and column permutations. However, real overlaps are rarely rectangular, where many valid matches may lie outside any single contiguous block. In this paper, we introduce the Shape-Agnostic Largest Table Overlap (SALTO), a novel generalized notion of overlap that captures arbitrary-shaped, non-contiguous overlaps between tables. To tackle the combinatorial complexity of row and column permutations, we propose to model each table as a hypergraph, casting SALTO computation into a maximum common subhypergraph problem. We prove their equivalence and show the problem is NP-hard to approximate. To solve it, we propose HyperSplit, a novel branch-and-bound algorithm tailored to table-induced hypergraphs. HyperSplit introduces (i) hypergraph-aware label classes that jointly encode cell values and their row-column memberships to ensure structurally valid correspondences without explicit permutation enumeration, (ii) incidence-guided refinement and upper-bound pruning that leverage row-column connectivity to eliminate infeasible partial matches early, and (iii) a tolerance-based optimization mechanism with a tunable parameter that relaxes pruning by a bounded margin to accelerate convergence, enabling scalable yet accurate overlap discovery. Experiments on real-world datasets show that HyperSplit discovers overlaps more effectively (larger overlaps in up to 78.8% of the cases) and more efficiently than state of the art. Three case studies further demonstrate its practical impact across three tasks: cross-source copy detection, data deduplication, and version comparison.

翻译：暂无翻译

0

相关内容

【NeurIPS2024】TableRAG：基于语言模型的百万标记表格理解

【NeurIPS2024】TableRAG：基于语言模型的百万标记表格理解

专知会员服务

38+阅读 · 2024年10月8日

最新《图神经网络知识图谱补全综述论文》A Survey on Graph Neural Networks for Knowledge Graph Completion

最新《图神经网络知识图谱补全综述论文》A Survey on Graph Neural Networks for Knowledge Graph Completion

专知会员服务

137+阅读 · 2020年7月29日

【NeurIPS 2019】多关系庞加莱图嵌入，Multi-relational Poincaré Graph Embeddings

【NeurIPS 2019】多关系庞加莱图嵌入，Multi-relational Poincaré Graph Embeddings

专知会员服务

49+阅读 · 2020年6月15日

【KDD2020】多层次图卷积网络的跨平台锚链预测，Multi-level Graph Convolutional Networks for Cross-platform Anchor Link Prediction

【KDD2020】多层次图卷积网络的跨平台锚链预测，Multi-level Graph Convolutional Networks for Cross-platform Anchor Link Prediction

专知会员服务

34+阅读 · 2020年6月7日

【论文推荐】层次知识图谱，Hierarchical Knowledge Graphs: A Novel Information Representation for Exploratory Search Tasks

【论文推荐】层次知识图谱，Hierarchical Knowledge Graphs: A Novel Information Representation for Exploratory Search Tasks

专知会员服务

49+阅读 · 2020年5月26日

【ACL2020-浙大-微软】多轮对话推理数据集，MuTual: A Dataset for Multi-Turn Dialogue Reasoning

【ACL2020-浙大-微软】多轮对话推理数据集，MuTual: A Dataset for Multi-Turn Dialogue Reasoning

专知会员服务

38+阅读 · 2020年4月10日

【大规模数据系统，552页ppt】Large-scale Data Systems

【大规模数据系统，552页ppt】Large-scale Data Systems

专知会员服务

61+阅读 · 2019年12月21日

【O'Reilly AI Conference 2019】部署大规模分布式数据（How to deploy large-scale distributed data analytics and machine learning on containers (sponsored by HPE))，HPE BlueData，Thomas Phelan

【O'Reilly AI Conference 2019】部署大规模分布式数据（How to deploy large-scale distributed data analytics and machine learning on containers (sponsored by HPE))，HPE BlueData，Thomas Phelan

专知会员服务

19+阅读 · 2019年11月5日

Natural Language Interface to Knowledge Graph (our experience) ，加州大学圣塔芭芭拉分校严锡峰副教授，CIPS ATT 16（2019）

Natural Language Interface to Knowledge Graph (our experience) ，加州大学圣塔芭芭拉分校严锡峰副教授，CIPS ATT 16（2019）

专知会员服务

16+阅读 · 2019年10月25日

【VLDB2019 tutorial】TextCube：自动构建和多维探索，TextCube: Automated Construction and Multidimensional Exploration，韩家炜，Jingbo Shang

【VLDB2019 tutorial】TextCube：自动构建和多维探索，TextCube: Automated Construction and Multidimensional Exploration，韩家炜，Jingbo Shang

专知会员服务

27+阅读 · 2019年8月29日

55页图深度学习导论《A Gentle Introduction to Deep Learning for Graphs》

55页图深度学习导论《A Gentle Introduction to Deep Learning for Graphs》

专知

16+阅读 · 2020年1月3日

【泡泡图灵智库】ContextDesc：用跨模态上下文增强的局部描述子

【泡泡图灵智库】ContextDesc：用跨模态上下文增强的局部描述子

泡泡机器人SLAM

34+阅读 · 2019年9月18日

Graph Neural Networks 综述

Graph Neural Networks 综述

计算机视觉life

30+阅读 · 2019年8月13日

【综述】3D数据分类深度学习方法综述，25页论文带你全面了解最新进展

【综述】3D数据分类深度学习方法综述，25页论文带你全面了解最新进展

中国人工智能学会

20+阅读 · 2019年7月17日

【泡泡点云时空】DeepMapping: 来自多重点云的无监督地图估计

【泡泡点云时空】DeepMapping: 来自多重点云的无监督地图估计

泡泡机器人SLAM

29+阅读 · 2019年5月29日

disentangled-representation-papers

disentangled-representation-papers

CreateAMind

26+阅读 · 2018年9月12日

图像检索研究进展：浅层、深层特征及特征融合

图像检索研究进展：浅层、深层特征及特征融合

机器学习研究会

65+阅读 · 2018年3月26日

IBM新论文|SamplePairing：针对图像处理领域的高效数据增强方式

IBM新论文|SamplePairing：针对图像处理领域的高效数据增强方式

极市平台

16+阅读 · 2018年1月20日

论文报告 | Graph-based Neural Multi-Document Summarization

论文报告 | Graph-based Neural Multi-Document Summarization

科技创新与创业

15+阅读 · 2017年12月15日

推荐中的序列化建模：Session-based neural recommendation

推荐中的序列化建模：Session-based neural recommendation

机器学习研究会

18+阅读 · 2017年11月5日

大规模多视角高维图像特征提取

国家自然科学基金

3+阅读 · 2017年12月31日

基于深度学习的海量截获卫星数据分析技术研究

国家自然科学基金

1+阅读 · 2015年12月31日

面向甲骨学知识图谱的实体发现及语义关系挖掘研究

国家自然科学基金

3+阅读 · 2015年12月31日

基于复杂数据的回归模型统计推断及其应用

国家自然科学基金

3+阅读 · 2015年12月31日

基于网络的复杂疾病动态表观修饰模块挖掘

国家自然科学基金

0+阅读 · 2015年12月31日

基于稳健估计方程的复杂纵向数据研究

国家自然科学基金

0+阅读 · 2015年12月31日

复杂纵向数据的分位回归建模及其在生物医学大数据中的应用

国家自然科学基金

4+阅读 · 2015年12月31日

复杂数据下带有形状约束的半参数模型统计推断

国家自然科学基金

3+阅读 · 2014年12月31日

大规模轨迹数据的地理空间关联解译及分析挖掘研究

国家自然科学基金

1+阅读 · 2014年12月31日

基于概率图的文本检索模型及算法研究

国家自然科学基金

2+阅读 · 2014年12月31日

Estimating Absolute Web Crawl Coverage From Longitudinal Set Intersections

Estimating Absolute Web Crawl Coverage From Longitudinal Set Intersections

Arxiv

0+阅读 · 3月16日

TableMark: A Multi-bit Watermark for Synthetic Tabular Data

Arxiv

0+阅读 · 3月14日

Loglinear modelling of huge contingency tables

Arxiv

0+阅读 · 3月7日

Expert-Aided Causal Discovery of Ancestral Graphs

Arxiv

0+阅读 · 3月6日

Exploring Singularities in point clouds with the graph Laplacian: An explicit approach

Arxiv

0+阅读 · 2月23日

A Survey of Deep Graph Clustering: Taxonomy, Challenge, and Application

Arxiv

13+阅读 · 2022年11月23日

Geometrically Equivariant Graph Neural Networks: A Survey

Arxiv

22+阅读 · 2022年2月16日

Reinforced Negative Sampling over Knowledge Graph for Recommendation

Arxiv

17+阅读 · 2020年3月12日

Graph Neural Networks: A Review of Methods and Applications

Graph Neural Networks: A Review of Methods and Applications

Arxiv

76+阅读 · 2018年12月20日

SpectralNet: Spectral Clustering using Deep Neural Networks

Arxiv

11+阅读 · 2018年1月10日

VIP会员

文章信息

相关主题

相关VIP内容

【NeurIPS2024】TableRAG：基于语言模型的百万标记表格理解

【NeurIPS2024】TableRAG：基于语言模型的百万标记表格理解

专知会员服务

38+阅读 · 2024年10月8日

最新《图神经网络知识图谱补全综述论文》A Survey on Graph Neural Networks for Knowledge Graph Completion

最新《图神经网络知识图谱补全综述论文》A Survey on Graph Neural Networks for Knowledge Graph Completion

专知会员服务

137+阅读 · 2020年7月29日

【NeurIPS 2019】多关系庞加莱图嵌入，Multi-relational Poincaré Graph Embeddings

【NeurIPS 2019】多关系庞加莱图嵌入，Multi-relational Poincaré Graph Embeddings

专知会员服务

49+阅读 · 2020年6月15日

【KDD2020】多层次图卷积网络的跨平台锚链预测，Multi-level Graph Convolutional Networks for Cross-platform Anchor Link Prediction

【KDD2020】多层次图卷积网络的跨平台锚链预测，Multi-level Graph Convolutional Networks for Cross-platform Anchor Link Prediction

专知会员服务

34+阅读 · 2020年6月7日

【论文推荐】层次知识图谱，Hierarchical Knowledge Graphs: A Novel Information Representation for Exploratory Search Tasks

【论文推荐】层次知识图谱，Hierarchical Knowledge Graphs: A Novel Information Representation for Exploratory Search Tasks

专知会员服务

49+阅读 · 2020年5月26日

【ACL2020-浙大-微软】多轮对话推理数据集，MuTual: A Dataset for Multi-Turn Dialogue Reasoning

【ACL2020-浙大-微软】多轮对话推理数据集，MuTual: A Dataset for Multi-Turn Dialogue Reasoning

专知会员服务

38+阅读 · 2020年4月10日

【大规模数据系统，552页ppt】Large-scale Data Systems

【大规模数据系统，552页ppt】Large-scale Data Systems

专知会员服务

61+阅读 · 2019年12月21日

【O'Reilly AI Conference 2019】部署大规模分布式数据（How to deploy large-scale distributed data analytics and machine learning on containers (sponsored by HPE))，HPE BlueData，Thomas Phelan

【O'Reilly AI Conference 2019】部署大规模分布式数据（How to deploy large-scale distributed data analytics and machine learning on containers (sponsored by HPE))，HPE BlueData，Thomas Phelan

专知会员服务

19+阅读 · 2019年11月5日

Natural Language Interface to Knowledge Graph (our experience) ，加州大学圣塔芭芭拉分校严锡峰副教授，CIPS ATT 16（2019）

Natural Language Interface to Knowledge Graph (our experience) ，加州大学圣塔芭芭拉分校严锡峰副教授，CIPS ATT 16（2019）

专知会员服务

16+阅读 · 2019年10月25日

【VLDB2019 tutorial】TextCube：自动构建和多维探索，TextCube: Automated Construction and Multidimensional Exploration，韩家炜，Jingbo Shang

【VLDB2019 tutorial】TextCube：自动构建和多维探索，TextCube: Automated Construction and Multidimensional Exploration，韩家炜，Jingbo Shang

专知会员服务

27+阅读 · 2019年8月29日

热门VIP内容

开通专知VIP会员享更多权益服务

乌克兰开放真实战场数据以训练国防人工智能

《基于近端策略优化算法的超视距空战机动》80页报告

美陆军下一代指挥控制（NGC2）原型系统借助Raft数据平台展示快速决策能力

《防空压制行动中机动与开火决策的强化学习方法》76页报告

相关资讯

55页图深度学习导论《A Gentle Introduction to Deep Learning for Graphs》

55页图深度学习导论《A Gentle Introduction to Deep Learning for Graphs》

专知

16+阅读 · 2020年1月3日

【泡泡图灵智库】ContextDesc：用跨模态上下文增强的局部描述子

【泡泡图灵智库】ContextDesc：用跨模态上下文增强的局部描述子

泡泡机器人SLAM

34+阅读 · 2019年9月18日

Graph Neural Networks 综述

Graph Neural Networks 综述

计算机视觉life

30+阅读 · 2019年8月13日

【综述】3D数据分类深度学习方法综述，25页论文带你全面了解最新进展

【综述】3D数据分类深度学习方法综述，25页论文带你全面了解最新进展

中国人工智能学会

20+阅读 · 2019年7月17日

【泡泡点云时空】DeepMapping: 来自多重点云的无监督地图估计

【泡泡点云时空】DeepMapping: 来自多重点云的无监督地图估计

泡泡机器人SLAM

29+阅读 · 2019年5月29日

disentangled-representation-papers

disentangled-representation-papers

CreateAMind

26+阅读 · 2018年9月12日

图像检索研究进展：浅层、深层特征及特征融合

图像检索研究进展：浅层、深层特征及特征融合

机器学习研究会

65+阅读 · 2018年3月26日

IBM新论文|SamplePairing：针对图像处理领域的高效数据增强方式

IBM新论文|SamplePairing：针对图像处理领域的高效数据增强方式

极市平台

16+阅读 · 2018年1月20日

论文报告 | Graph-based Neural Multi-Document Summarization

论文报告 | Graph-based Neural Multi-Document Summarization

科技创新与创业

15+阅读 · 2017年12月15日

推荐中的序列化建模：Session-based neural recommendation

推荐中的序列化建模：Session-based neural recommendation

机器学习研究会

18+阅读 · 2017年11月5日

相关论文

Estimating Absolute Web Crawl Coverage From Longitudinal Set Intersections

Estimating Absolute Web Crawl Coverage From Longitudinal Set Intersections

Arxiv

0+阅读 · 3月16日

TableMark: A Multi-bit Watermark for Synthetic Tabular Data

Arxiv

0+阅读 · 3月14日

Loglinear modelling of huge contingency tables

Arxiv

0+阅读 · 3月7日

Expert-Aided Causal Discovery of Ancestral Graphs

Arxiv

0+阅读 · 3月6日

Exploring Singularities in point clouds with the graph Laplacian: An explicit approach

Arxiv

0+阅读 · 2月23日

A Survey of Deep Graph Clustering: Taxonomy, Challenge, and Application

Arxiv

13+阅读 · 2022年11月23日

Geometrically Equivariant Graph Neural Networks: A Survey

Arxiv

22+阅读 · 2022年2月16日

Reinforced Negative Sampling over Knowledge Graph for Recommendation

Arxiv

17+阅读 · 2020年3月12日

Graph Neural Networks: A Review of Methods and Applications

Graph Neural Networks: A Review of Methods and Applications

Arxiv

76+阅读 · 2018年12月20日

SpectralNet: Spectral Clustering using Deep Neural Networks

Arxiv

11+阅读 · 2018年1月10日

相关基金

大规模多视角高维图像特征提取

国家自然科学基金

3+阅读 · 2017年12月31日

基于深度学习的海量截获卫星数据分析技术研究

国家自然科学基金

1+阅读 · 2015年12月31日

面向甲骨学知识图谱的实体发现及语义关系挖掘研究

国家自然科学基金

3+阅读 · 2015年12月31日

基于复杂数据的回归模型统计推断及其应用

国家自然科学基金

3+阅读 · 2015年12月31日

基于网络的复杂疾病动态表观修饰模块挖掘

国家自然科学基金

0+阅读 · 2015年12月31日

基于稳健估计方程的复杂纵向数据研究

国家自然科学基金

0+阅读 · 2015年12月31日

复杂纵向数据的分位回归建模及其在生物医学大数据中的应用

国家自然科学基金

4+阅读 · 2015年12月31日

复杂数据下带有形状约束的半参数模型统计推断

国家自然科学基金

3+阅读 · 2014年12月31日

大规模轨迹数据的地理空间关联解译及分析挖掘研究

国家自然科学基金

1+阅读 · 2014年12月31日

基于概率图的文本检索模型及算法研究

国家自然科学基金

2+阅读 · 2014年12月31日

微信扫码咨询专知VIP会员