Popular centroid-based clustering methods are typically optimized for global objectives, and may fail to adequately represent large groups of datapoints. Thus, one needs proportionality notions suited for metric settings. Ideally, such notions should admit polynomial-time algorithms for (a) finding proportional outcomes, and (b) checking if a given outcome is proportional; the latter enables evaluation of traditional algorithms without proportionality guarantees (e.g., $k$-means). A promising approach imports proportionality notions from multiwinner voting with approval ballots. In particular, mPJR, the metric version of the well-known Proportional Justified Representation (PJR) axiom, satisfies (a), but whether it satisfies (b) was open. In this work, we study the computational complexity of auditing proportional representation in clustering. In the approval setting, PJR is coNP-complete to verify; however, it admits a strengthening PJR+, which satisfies (a) and (b). We show these results translate to the metric setting: mPJR is coNP-complete to verify, we define mPJR+, a metric analog of PJR+, and argue mPJR+ satisfies (a) and (b). However, auditing mPJR+ relies on repeated submodular minimization, rendering it impractical at scale, and a natural combinatorial approach is infeasible. As a partial remedy, we propose an mPJR+ verification algorithm exponential in $k$ but quasilinear in the number of datapoints. Motivated by these hardness results, we introduce DC-mPJR+: a proportionality concept offering representation guarantees to a restricted set of coalitions around unselected centers, admitting an $O(mn \log n + mnk)$ verification algorithm. DC-mPJR+ outcomes can be computed efficiently, and any $γ$-DC-mPJR+ solution satisfies $(γ+ 2)$-mPJR+.
翻译:流行的基于质心的聚类方法通常针对全局目标进行优化,可能无法充分代表大数据点群。因此,需要适用于度量设置的数比例概念。理想情况下,这些概念应支持多项式时间算法,用于(a)寻找比例结果,以及(b)检查给定结果是否成比例;后者能够评估传统无比例保证的算法(例如$k$-均值)。一种有前景的方法从使用赞成票的多赢家投票中引入比例概念。具体而言,mPJR(著名比例正当代表(PJR)公理的度量版本)满足(a),但能否满足(b)尚未明确。本文研究聚类中比例代表审计的计算复杂性。在赞成票设置下,验证PJR是coNP完全的;然而,其强化版本PJR+同时满足(a)和(b)。我们证明这些结果可迁移至度量设置:验证mPJR是coNP完全的;定义mPJR+(PJR+的度量类比),并论证mPJR+满足(a)和(b)。但审计mPJR+依赖于重复的子模最小化,使其在大规模下不实用,且自然的组合方法不可行。作为部分补救,我们提出一种mPJR+验证算法,时间复杂度在$k$上指数级,但在数据点数上近似线性。受这些困难结果启发,我们引入DC-mPJR+:一种比例概念,为未选中质心周围的有限联盟集提供代表保障,并支持$O(mn \log n + mnk)$的验证算法。DC-mPJR+结果可高效计算,且任意$γ$-DC-mPJR+解满足$(γ+ 2)$-mPJR+。