We propose a novel subset selection task called min-distance diverse data summarization ($\textsf{MDDS}$), which has a wide variety of applications in machine learning, e.g., data sampling and feature selection. Given a set of points in a metric space, the goal is to maximize an objective that combines the total utility of the points and a diversity term that captures the minimum distance between any pair of selected points, subject to the constraint $|S| \le k$. For example, the points may correspond to training examples in a data sampling problem, e.g., learned embeddings of images extracted from a deep neural network. This work presents the $\texttt{GIST}$ algorithm, which achieves a $\frac{2}{3}$-approximation guarantee for $\textsf{MDDS}$ by approximating a series of maximum independent set problems with a bicriteria greedy algorithm. We also prove a complementary $(\frac{2}{3}+\varepsilon)$-hardness of approximation, for any $\varepsilon > 0$. Finally, we provide an empirical study that demonstrates $\texttt{GIST}$ outperforms existing methods for $\textsf{MDDS}$ on synthetic data, and also for a real-world image classification experiment the studies single-shot subset selection for ImageNet.
翻译:我们提出了一种新颖的子集选择任务,称为最小距离多样化数据摘要($\textsf{MDDS}$),该任务在机器学习中具有广泛的应用,例如数据采样和特征选择。给定度量空间中的一组点,目标是在约束条件 $|S| \le k$ 下,最大化一个结合了点的总效用与捕获任意一对选定点之间最小距离的多样性项的目标函数。例如,这些点可能对应于数据采样问题中的训练样本,例如从深度神经网络中提取的图像学习嵌入。本工作提出了 $\texttt{GIST}$ 算法,该算法通过使用双准则贪心算法逼近一系列最大独立集问题,为 $\textsf{MDDS}$ 实现了 $\frac{2}{3}$ 的近似保证。我们还证明了互补的 $(\frac{2}{3}+\varepsilon)$ 近似难度下界,其中 $\varepsilon > 0$。最后,我们提供了实证研究,证明 $\texttt{GIST}$ 在合成数据上优于现有的 $\textsf{MDDS}$ 方法,并且在一个研究 ImageNet 单次子集选择的真实世界图像分类实验中也表现优异。