Identifying the causes of a model's unfairness is an important yet relatively unexplored task. We look into this problem through the lens of training data - the major source of unfairness. We ask the following questions: How would the unfairness of a model change if its training samples (1) were collected from a different (e.g. demographic) group, (2) were labeled differently, or (3) whose features were modified? In other words, we quantify the influence of training samples on unfairness by counterfactually changing samples based on predefined concepts, i.e. data attributes such as features, labels, and sensitive attributes. Our framework not only can help practitioners understand the observed unfairness and mitigate it by repairing their training data, but also leads to many other applications, e.g. detecting mislabeling, fixing imbalanced representations, and detecting fairness-targeted poisoning attacks.
翻译:识别模型不公平性的成因是一项重要但尚未充分探索的任务。本研究从训练数据这一不公平性的主要来源出发进行探究。我们提出以下问题:若训练样本(1)来自不同(例如人口统计)群体、(2)标签不同、(3)特征被修改,模型的不公平性将如何变化?换言之,我们通过基于预定义概念(即特征、标签和敏感属性等数据属性)对样本进行反事实修改,量化训练样本对不公平性的影响。本框架不仅能够帮助从业者理解观察到的公平性偏差并通过修复训练数据加以缓解,还可衍生出众多其他应用,例如检测错误标注、修复不平衡表示以及识别针对公平性的投毒攻击。