The goal of fair clustering is to find clusters such that the proportion of sensitive attributes (e.g., gender, race, etc.) in each cluster is similar to that of the entire dataset. Various fair clustering algorithms have been proposed that modify standard K-means clustering to satisfy a given fairness constraint. A critical limitation of several existing fair clustering algorithms is that the number of parameters to be learned is proportional to the sample size because the cluster assignment of each datum should be optimized simultaneously with the cluster center, and thus scaling up the algorithms is difficult. In this paper, we propose a new fair clustering algorithm based on a finite mixture model, called Fair Model-based Clustering (FMC). A main advantage of FMC is that the number of learnable parameters is independent of the sample size and thus can be scaled up easily. In particular, mini-batch learning is possible to obtain clusters that are approximately fair. Moreover, FMC can be applied to non-metric data (e.g., categorical data) as long as the likelihood is well-defined. Theoretical and empirical justifications for the superiority of the proposed algorithm are provided.
翻译:公平聚类的目标是找到这样的簇:每个簇中敏感属性(如性别、种族等)的比例与整个数据集中的比例相似。已有多种公平聚类算法被提出,它们通过修改标准的K均值聚类来满足给定的公平性约束。现有几种公平聚类算法的一个关键局限在于,由于每个数据点的簇分配需要与簇中心同时优化,待学习的参数量与样本规模成正比,因此难以扩展算法规模。在本文中,我们提出了一种基于有限混合模型的新公平聚类算法,称为公平的基于模型的聚类(FMC)。FMC的主要优势在于其可学习参数量与样本规模无关,因而易于扩展。特别地,通过小批量学习可以获得近似公平的簇。此外,只要似然函数定义良好,FMC可应用于非度量数据(如分类数据)。本文从理论和实证两方面论证了所提算法的优越性。