Quantifying Spuriousness of Biased Datasets Using Partial Information Decomposition

Spurious patterns refer to a mathematical association between two or more variables in a dataset that are not causally related. However, this notion of spuriousness, which is usually introduced due to sampling biases in the dataset, has classically lacked a formal definition. To address this gap, this work presents the first information-theoretic formalization of spuriousness in a dataset (given a split of spurious and core features) using a mathematical framework called Partial Information Decomposition (PID). Specifically, we disentangle the joint information content that the spurious and core features share about another target variable (e.g., the prediction label) into distinct components, namely unique, redundant, and synergistic information. We propose the use of unique information, with roots in Blackwell Sufficiency, as a novel metric to formally quantify dataset spuriousness and derive its desirable properties. We empirically demonstrate how higher unique information in the spurious features in a dataset could lead a model into choosing the spurious features over the core features for inference, often having low worst-group-accuracy. We also propose a novel autoencoder-based estimator for computing unique information that is able to handle high-dimensional image data. Finally, we also show how this unique information in the spurious feature is reduced across several dataset-based spurious-pattern-mitigation techniques such as data reweighting and varying levels of background mixing, demonstrating a novel tradeoff between unique information (spuriousness) and worst-group-accuracy.

翻译：伪相关性指的是数据集中两个或多个变量之间存在的数学关联，但这些变量并不具有因果关系。然而，这种通常由数据集抽样偏差引入的伪相关性概念，长期以来缺乏形式化的定义。为填补这一空白，本研究首次利用名为偏信息分解（PID）的数学框架，对数据集中的伪相关性（在给定伪特征与核心特征划分的条件下）进行了信息论形式化。具体而言，我们将伪特征与核心特征关于另一目标变量（例如预测标签）的联合信息内容解耦为不同的成分，即独特信息、冗余信息与协同信息。我们提出使用源于布莱克韦尔充分性的独特信息作为新颖的度量指标，以形式化量化数据集的伪相关性，并推导其理想性质。我们通过实证表明，当数据集中伪特征具有更高的独特信息时，会导致模型在推理过程中选择伪特征而非核心特征，通常伴随较低的最差组准确率。我们还提出了一种基于自编码器的新型估计器，用于计算能够处理高维图像数据的独特信息。最后，我们展示了在多种基于数据集的伪相关性缓解技术（如数据重加权和不同背景混合程度）中，伪特征的独特信息如何被降低，从而揭示了独特信息（伪相关性）与最差组准确率之间的新颖权衡关系。