Auto-Relate: A Unified Approach to Discovering Reliable Functional Relationships Leveraging Statistical Tests

Tables in spreadsheets, computational notebooks, and databases often contain rich inter-column relationships. Yet these relationships are typically implicit and are often lost when tables are exported to standard formats. Recovering them can benefit downstream tasks, including table understanding, data quality improvement, and provenance analysis. However, simply mining relationships that hold on an observed table is insufficient, as many are spurious due to coincidence, redundancy, or limited data diversity. In this paper, we introduce functional relationships (FRs) as a unified notion for inter-column relationships in tables, subsuming arithmetic relationships, string transformations, and functional dependencies. We characterize FR reliability through four complementary criteria: accuracy, atomicity, stability, and integrity. Guided by these criteria, we propose Auto-Relate, a mine-then-verify framework that first generates accurate candidate FRs and then verifies the remaining reliability criteria through a Minimality Test, a Perturbation Test, and an Independence Test, respectively. To further improve efficiency, we develop three optimization strategies, including a group-by lower bound for early rejection, a closed-form speedup for arithmetic FRs, and a binomial bound for statistically guided early termination. We construct a large-scale benchmark suite from 58,679 real-world spreadsheets and relational tables, containing 6,414 ground-truth FRs spanning all three FR types. Extensive experiments against 18 baselines show that Auto-Relate consistently achieves the best performance, with an average PR-AUC of 0.87, 59% higher than the best competing baseline across all settings.

翻译：电子表格、计算笔记本和数据库中的表格通常包含丰富的列间关系。然而，这些关系通常是隐式的，且常在表格导出为标准格式时丢失。恢复这些关系有助于下游任务，包括表格理解、数据质量改进和溯源分析。但仅挖掘在观测表上成立的关系是不够的，因为许多关系因巧合、冗余或数据多样性有限而具有虚假性。本文引入函数关系（FR）作为表格中列间关系的统一概念，涵盖算术关系、字符串转换和函数依赖。我们通过四个互补准则刻画FR的可靠性：准确性、原子性、稳定性和完整性。基于这些准则，我们提出Auto-Relate框架，采用"先挖掘后验证"策略：首先生成准确的候选FR，然后分别通过极小性检验、扰动检验和独立性检验验证其余可靠性准则。为进一步提升效率，我们开发了三种优化策略：用于早期拒绝的分组下界法、用于算术FR的闭式加速法，以及用于统计引导早期终止的二项式界法。我们从58,679个真实电子表格和关系表中构建了大规模基准数据集，包含6,414个涵盖所有三类FR的真实标注。针对18个基线的广泛实验表明，Auto-Relate在所有设置中均实现最优性能，平均PR-AUC达0.87，较最佳竞争基线提升59%。