Fair Correlation Clustering in Forests

The study of algorithmic fairness received growing attention recently. This stems from the awareness that bias in the input data for machine learning systems may result in discriminatory outputs. For clustering tasks, one of the most central notions of fairness is the formalization by Chierichetti, Kumar, Lattanzi, and Vassilvitskii [NeurIPS 2017]. A clustering is said to be fair, if each cluster has the same distribution of manifestations of a sensitive attribute as the whole input set. This is motivated by various applications where the objects to be clustered have sensitive attributes that should not be over- or underrepresented. We discuss the applicability of this fairness notion to Correlation Clustering. The existing literature on the resulting Fair Correlation Clustering problem either presents approximation algorithms with poor approximation guarantees or severely limits the possible distributions of the sensitive attribute (often only two manifestations with a 1:1 ratio are considered). Our goal is to understand if there is hope for better results in between these two extremes. To this end, we consider restricted graph classes which allow us to characterize the distributions of sensitive attributes for which this form of fairness is tractable from a complexity point of view. While existing work on Fair Correlation Clustering gives approximation algorithms, we focus on exact solutions and investigate whether there are efficiently solvable instances. The unfair version of Correlation Clustering is trivial on forests, but adding fairness creates a surprisingly rich picture of complexities. We give an overview of the distributions and types of forests where Fair Correlation Clustering turns from tractable to intractable. The most surprising insight to us is the fact that the cause of the hardness of Fair Correlation Clustering is not the strictness of the fairness condition.

翻译：算法公平性的研究近年来受到越来越多的关注。这源于对机器学习系统输入数据中的偏差可能导致歧视性输出的认识。对于聚类任务，最核心的公平性概念之一是由Chierichetti、Kumar、Lattanzi和Vassilvitskii [NeurIPS 2017]形式化定义的。如果一个聚类中每个簇关于敏感属性的分布与整个输入集的分布相同，则该聚类被认为是公平的。这源于多种应用场景，其中待聚类的对象具有不应被过度或不足代表的敏感属性。我们讨论了这一公平性概念在相关聚类中的适用性。现有关于由此产生的公平相关聚类问题的文献，要么提供了近似保证较差的近似算法，要么严重限制了敏感属性可能出现的分布（通常仅考虑1:1比例的两个表现）。我们的目标是理解在这两个极端之间是否存在获得更好结果的希望。为此，我们考虑了受限图类，这使得我们能够从复杂性角度刻画那些使这种公平性形式易于处理的敏感属性分布。尽管公平相关聚类的现有工作给出了近似算法，但我们关注精确解，并研究是否存在可高效求解的实例。不公平版本的相关聚类在森林上是平凡的，但加入公平性却产生了令人惊讶的丰富复杂性图景。我们概述了公平相关聚类从可处理变为难处理的分布和森林类型。对我们来说最令人惊讶的发现是，公平相关聚类难以处理的原因并非公平性条件本身的严格性。