We study core-set construction algorithms for the task of Diversity Maximization under fairness/partition constraint. Given a set of points $P$ in a metric space partitioned into $m$ groups, and given $k_1,\ldots,k_m$, the goal of this problem is to pick $k_i$ points from each group $i$ such that the overall diversity of the $k=\sum_i k_i$ picked points is maximized. We consider two natural diversity measures: sum-of-pairwise distances and sum-of-nearest-neighbor distances, and show improved core-set construction algorithms with respect to these measures. More precisely, we show the first constant factor core-set w.r.t. sum-of-pairwise distances whose size is independent of the size of the dataset and the aspect ratio. Second, we show the first core-set w.r.t. the sum-of-nearest-neighbor distances. Finally, we run several experiments showing the effectiveness of our core-set approach. In particular, we apply constrained diversity maximization to summarize a set of timed messages that takes into account the messages' recency. Specifically, the summary should include more recent messages compared to older ones. This is a real task in one of the largest communication platforms, affecting the experience of hundreds of millions daily active users. By utilizing our core-set method for this task, we achieve a 100x speed-up while losing the diversity by only a few percent. Moreover, our approach allows us to improve the space usage of the algorithm in the streaming setting.
翻译:我们研究了在公平/分区约束下多样性最大化任务的核心集构建算法。给定度量空间中划分成 $m$ 组的点集 $P$ 以及参数 $k_1,\ldots,k_m$,该问题的目标是从每个组 $i$ 中选取 $k_i$ 个点,使得所选 $k=\sum_i k_i$ 个点的整体多样性最大化。我们考虑两种自然的多样性度量:点对距离之和与最近邻距离之和,并针对这些度量展示了改进的核心集构建算法。更具体地,我们首次证明了相对于点对距离之和的常数因子核心集,其大小与数据集大小及宽高比无关。其次,我们展示了首个针对最近邻距离之和的核心集。最后,我们通过多项实验验证了核心集方法的有效性。特别地,我们将带约束的多样性最大化应用于摘要一组计时消息,该过程需考虑消息的新旧程度。具体而言,摘要应包含更多新消息而非旧消息。这是某大型通信平台中的实际任务,影响着数亿日活跃用户的体验。通过将我们的核心集方法应用于此任务,我们在多样性仅损失几个百分点的情况下实现了100倍加速。此外,我们的方法还改进了流式场景下算法的空间使用效率。