We present an unsupervised method for aggregating anomalies in tabular datasets by identifying the top-k tabular data quality insights. Each insight consists of a set of anomalous attributes and the corresponding subsets of records that serve as evidence to the user. The process of identifying these insight blocks is challenging due to (i) the absence of labeled anomalies, (ii) the exponential size of the subset search space, and (iii) the complex dependencies among attributes, which obscure the true sources of anomalies. Simple frequency-based methods fail to capture these dependencies, leading to inaccurate results. To address this, we introduce Tab-Shapley, a cooperative game theory based framework that uses Shapley values to quantify the contribution of each attribute to the data's anomalous nature. While calculating Shapley values typically requires exponential time, we show that our game admits a closed-form solution, making the computation efficient. We validate the effectiveness of our approach through empirical analysis on real-world tabular datasets with ground-truth anomaly labels.
翻译:本文提出一种无监督方法,通过识别表格数据质量的前k项关键洞察来聚合表格数据集中的异常。每项洞察包含一组异常属性及对应的记录子集,这些子集为用户提供证据支持。识别这些洞察块的过程面临三重挑战:(i) 缺乏标注异常,(ii) 子集搜索空间呈指数级增长,(iii) 属性间复杂的依赖关系会掩盖异常的真实来源。基于简单频率的方法无法捕捉这些依赖关系,导致结果不准确。为此,我们提出Tab-Shapley——一个基于合作博弈论的框架,该框架利用Shapley值量化每个属性对数据异常特征的贡献度。虽然计算Shapley值通常需要指数级时间,但我们证明该博弈模型存在闭式解,从而实现了高效计算。我们通过在具有真实异常标注的现实世界表格数据集上进行实证分析,验证了该方法的有效性。