Affinity Clustering Framework for Data Debiasing Using Pairwise Distribution Discrepancy

Group imbalance, resulting from inadequate or unrepresentative data collection methods, is a primary cause of representation bias in datasets. Representation bias can exist with respect to different groups of one or more protected attributes and might lead to prejudicial and discriminatory outcomes toward certain groups of individuals; in cases where a learning model is trained on such biased data. This paper presents MASC, a data augmentation approach that leverages affinity clustering to balance the representation of non-protected and protected groups of a target dataset by utilizing instances of the same protected attributes from similar datasets that are categorized in the same cluster as the target dataset by sharing instances of the protected attribute. The proposed method involves constructing an affinity matrix by quantifying distribution discrepancies between dataset pairs and transforming them into a symmetric pairwise similarity matrix. A non-parametric spectral clustering is then applied to this affinity matrix, automatically categorizing the datasets into an optimal number of clusters. We perform a step-by-step experiment as a demo of our method to show the procedure of the proposed data augmentation method and evaluate and discuss its performance. A comparison with other data augmentation methods, both pre- and post-augmentation, is conducted, along with a model evaluation analysis of each method. Our method can handle non-binary protected attributes so, in our experiments, bias is measured in a non-binary protected attribute setup w.r.t. racial groups distribution for two separate minority groups in comparison with the majority group before and after debiasing. Empirical results imply that our method of augmenting dataset biases using real (genuine) data from similar contexts can effectively debias the target datasets comparably to existing data augmentation strategies.

翻译：群体不平衡源于不充分或缺乏代表性的数据收集方法，是数据集中表征偏差的主要成因。表征偏差可能存在于一个或多个受保护属性的不同群体中，当学习模型基于此类有偏数据训练时，可能导致对某些群体产生偏见和歧视性结果。本文提出MASC方法，这是一种数据增强策略，通过利用与目标数据集处于同一聚类（通过共享受保护属性实例进行归类）的相似数据集中相同受保护属性的实例，采用亲和性聚类平衡目标数据集非受保护群体与受保护群体的表征。该方法首先通过量化数据集对之间的分布差异构建亲和矩阵，并将其转换为对称的成对相似性矩阵。随后对该亲和矩阵应用非参数谱聚类，自动将数据集归类为最优数量的聚类。我们通过逐步实验演示该方法，展示所提出数据增强技术的实施流程，并评估讨论其性能表现。实验环节包括与其它数据增强方法在增强前后的对比，以及对每种方法进行模型评估分析。本方法能够处理非二元受保护属性，因此在实验中，我们在非二元受保护属性设置下，以少数族裔群体（分别针对两个少数群体）在去偏前后与多数群体的分布差异为指标测量偏差。实证结果表明，本方法通过使用相似情境中的真实数据增强数据集，能够有效消除目标数据集偏差，其效果与现有数据增强策略相当。