We study the power of uniform sampling for $k$-Median in various metric spaces. We relate the query complexity for approximating $k$-Median, to a key parameter of the dataset, called the balancedness $\beta \in (0, 1]$ (with $1$ being perfectly balanced). We show that any algorithm must make $\Omega(1 / \beta)$ queries to the point set in order to achieve $O(1)$-approximation for $k$-Median. This particularly implies existing constructions of coresets, a popular data reduction technique, cannot be query-efficient. On the other hand, we show a simple uniform sample of $\mathrm{poly}(k \epsilon^{-1} \beta^{-1})$ points suffices for $(1 + \epsilon)$-approximation for $k$-Median for various metric spaces, which nearly matches the lower bound. We conduct experiments to verify that in many real datasets, the balancedness parameter is usually well bounded, and that the uniform sampling performs consistently well even for the case with moderately large balancedness, which justifies that uniform sampling is indeed a viable approach for solving $k$-Median.
翻译:我们研究了均匀采样在各类度量空间中用于$k$-中位数问题的效力。我们将逼近$k$-中位数问题的查询复杂度与数据集的一个关键参数——平衡性$\beta \in (0, 1]$(其中$1$表示完全平衡)联系起来。我们证明,任何算法都需对点集进行$\Omega(1 / \beta)$次查询才能实现$k$-中位数问题的$O(1)$-近似。这一结论特别意味着现有的核心集构造(一种流行的数据缩减技术)无法实现查询高效。另一方面,我们证明在各类度量空间中,仅需对$\mathrm{poly}(k \epsilon^{-1} \beta^{-1})$个点进行简单均匀采样,即可实现$k$-中位数问题的$(1 + \epsilon)$-近似,这几乎匹配下界。我们通过实验验证,在许多真实数据集中,平衡性参数通常有良好界度,且即便在平衡性中等偏大的情况下,均匀采样仍能持续表现良好,这证明了均匀采样确实是解决$k$-中位数问题的可行方法。