Clustering is an unsupervised learning task that aims to partition data into a set of clusters. In many applications, these clusters correspond to real-world constructs (e.g. electoral districts) whose benefit can only be attained by groups when they reach a minimum level of representation (e.g. 50\% to elect their desired candidate). This paper considers the problem of performing k-means clustering while ensuring groups (e.g. demographic groups) have that minimum level of representation in a specified number of clusters. We show that the popular $k$-means algorithm, Lloyd's algorithm, can result in unfair outcomes where certain groups lack sufficient representation past the minimum threshold in a proportional number of clusters. We formulate the problem through a mixed-integer optimization framework and present a variant of Lloyd's algorithm, called MiniReL, that directly incorporates the fairness constraints. We show that incorporating the fairness criteria leads to a NP-Hard sub-problem within Lloyd's algorithm, but we provide computational approaches that make the problem tractable for even large datasets. Numerical results show that the approach is able to create fairer clusters with practically no increase in the k-means clustering cost across standard benchmark datasets.
翻译:聚类是一种无监督学习任务,旨在将数据划分为若干簇。在许多应用中,这些簇对应现实世界的结构(例如选区),只有当群体达到最低代表阈值(例如50%以选举其期望的候选人)时,才能从中获益。本文研究在执行k-means聚类时,确保群体(如人口统计群体)在指定数量的簇中达到最低代表水平的问题。我们证明,流行的k-means算法——劳埃德算法——可能导致不公平的结果:某些群体在相应比例的簇中无法满足最低代表阈值。我们通过混合整数优化框架形式化该问题,并提出劳埃德算法的一种变体MiniReL,该变体直接融入公平性约束。我们表明,引入公平性准则会导致劳埃德算法中出现NP-困难子问题,但提供的计算方法使得该问题对大规模数据集仍可处理。数值实验表明,该方法能在标准基准数据集上生成更公平的聚类,且几乎不增加k-means聚类代价。