Given a set of points in $d$-dimensional space, an explainable clustering is one where the clusters are specified by a tree of axis-aligned threshold cuts. Dasgupta et al. (ICML 2020) posed the question of the price of explainability: the worst-case ratio between the cost of the best explainable clusterings to that of the best clusterings. We show that the price of explainability for $k$-medians is at most $1+H_{k-1}$; in fact, we show that the popular Random Thresholds algorithm has exactly this price of explanability, matching the known lower bound constructions. We complement our tight analysis of this particular algorithm by constructing instances where the price of explanability (using any algorithm) is at least $(1-o(1)) \ln k$, showing that our result is best possible, up to lower-order terms. We also improve the price of explanability for the $k$-means problem to $O(k \ln \ln k)$ from the previous $O(k \ln k)$, considerably closing the gap to the lower bounds of $\Omega(k)$. Finally, we study the algorithmic question of finding the best explainable clustering: We show that explainable $k$-medians and $k$-means cannot be approximated better than $O(\ln k)$, under standard complexity-theoretic conjectures. This essentially settles the approximability of explainable $k$-medians and leaves open the intriguing possibility to get significantly better approximation algorithms for $k$-means than its price of explainability.
翻译:给定$d$维空间中的一组点,可解释聚类是指由轴对齐阈值切割树确定的聚类。Dasgupta等人(ICML 2020)提出了可解释性代价问题:最佳可解释聚类与最佳聚类之间成本的最坏情况比率。我们证明,$k$-中位数的可解释性代价至多为$1+H_{k-1}$;实际上,我们证明流行的随机阈值算法恰好具有这一可解释性代价,与已知下界构造相匹配。在完成对该特定算法的紧致分析后,我们构造了实例表明(使用任何算法时)可解释性代价至少为$(1-o(1)) \ln k$,这证明我们的结果在低阶项意义下是最优的。对于$k$-均值问题,我们还将其可解释性代价从之前的$O(k \ln k)$改进至$O(k \ln \ln k)$,显著缩小了与$\Omega(k)$下界之间的差距。最后,我们研究寻找最佳可解释聚类的算法问题:在标准计算复杂性假设下,可解释$k$-中位数和$k$-均值不可能被近似至优于$O(\ln k)$。这基本确定了可解释$k$-中位数的可近似性,并为$k$-均值留下了获得显著优于其可解释性代价的近似算法的有趣可能性。