Weighted Set Multi-Cover on Bounded Universe and Applications in Package Recommendation

The weighted set multi-cover problem is a fundamental generalization of set cover that arises in data-driven applications where one must select a small, low-cost subset from a large collection of candidates under coverage constraints. In data management settings, such problems arise naturally either as expressive database queries or as post-processing steps over query results, for example, when selecting representative or diverse subsets from large relations returned by database queries for decision support, recommendation, fairness-aware data selection, or crowd-sourcing. While the general weighted set multi-cover problem is NP-complete, many practical workloads involve a \emph{bounded universe} of items that must be covered, leading to the Weighted Set Multi-Cover with Bounded Universe (WSMC-BU) problem, where the universe size is constant. In this paper, we develop exact and approximation algorithms for WSMC-BU. We first discuss a dynamic programming algorithm that solves WSMC-BU exactly in $O(n^{\ell+1})$ time, where $n$ is the number of input sets and $\ell=O(1)$ is the universe size. We then present a $2$-approximation algorithm based on linear programming and rounding, running in $O(\mathcal{L}(n))$ time, where $\mathcal{L}(n)$ denotes the complexity of solving a linear program with $O(n)$ variables. To further improve efficiency for large datasets, we propose a faster $(2+\varepsilon)$-approximation algorithm with running time $O(n \log n + \mathcal{L}(\log W))$, where $W$ is the ratio of the total weight to the minimum weight, and $\varepsilon$ is an arbitrary constant specified by the user. Extensive experiments on real and synthetic datasets demonstrate that our methods consistently outperform greedy and standard LP-rounding baselines in both solution quality and runtime, making them suitable for data-intensive selection tasks over large query outputs.

翻译：带权集合多重覆盖问题是集合覆盖问题的一个基本推广，在数据驱动的应用中频繁出现，这类应用需要在覆盖约束下从大量候选集合中选择一个规模较小、成本较低的子集。在数据管理场景中，此类问题自然产生于表达性数据库查询或对查询结果的后续处理步骤，例如从数据库查询返回的大型关系中选择代表性或多样化的子集，以支持决策制定、推荐系统、公平感知数据选择或众包任务。尽管一般的带权集合多重覆盖问题是NP完全问题，但许多实际工作负载涉及需要被覆盖的物品构成一个\emph{有界论域}，从而引出了带权集合多重覆盖问题在有界论域上的变体，其中论域规模为常数。本文针对该问题开发了精确算法与近似算法。我们首先讨论了一种动态规划算法，该算法能在$O(n^{\ell+1})$时间内精确求解该问题，其中$n$为输入集合的数量，$\ell=O(1)$为论域大小。随后，我们提出了一种基于线性规划与舍入的$2$-近似算法，其运行时间为$O(\mathcal{L}(n))$，其中$\mathcal{L}(n)$表示求解具有$O(n)$个变量的线性规划问题的复杂度。为了进一步提升处理大规模数据集时的效率，我们提出了一种更快的$(2+\varepsilon)$-近似算法，其运行时间为$O(n \log n + \mathcal{L}(\log W))$，其中$W$为总权重与最小权重之比，$\varepsilon$为用户指定的任意常数。在真实与合成数据集上进行的大量实验表明，我们的方法在解的质量与运行时间上均持续优于贪心算法及标准线性规划舍入基线，使其适用于对大规模查询输出进行数据密集型选择的任务。