Given a collection of $m$ sets from a universe $\mathcal{U}$, the Maximum Set Coverage problem consists of finding $k$ sets whose union has largest cardinality. This problem is NP-Hard, but the solution can be approximated by a polynomial time algorithm up to a factor $1-1/e$. However, this algorithm does not scale well with the input size. In a streaming context, practical high-quality solutions are found, but with space complexity that scales linearly with respect to the size of the universe $|\mathcal{U}|$. However, one randomized streaming algorithm has been shown to produce a $1-1/e-\varepsilon$ approximation of the optimal solution with a space complexity that scales only poly-logarithmically with respect to $m$ and $|\mathcal{U}|$. In order to achieve such a low space complexity, the authors used a technique called subsampling, based on independent-wise hash functions, and $F_0$-sketching. This article focuses on this sublinear-space algorithm and introduces methods to reduce the time cost of subsampling. Firstly, we give some optimizations that do not alter the space complexity, number of passes and approximation quality of the original algorithm. In particular, we reanalyze the error bounds to show that the original independence factor of $\Omega(\varepsilon^{-2} k \log m)$ can be fine-tuned to $\Omega(k \log m)$. Secondly we show that $F_0$-sketching can be replaced by a much more simple mechanism. Finally, our experimental results show that even a pairwise-independent hash-function sampler does not produce worse solution than the original algorithm, while running significantly faster by several orders of magnitude.
翻译:给定来自全集 $\mathcal{U}$ 的 $m$ 个集合,最大集合覆盖问题旨在选取 $k$ 个集合使得其并集的基数最大化。该问题为NP难问题,但可通过多项式时间算法获得 $1-1/e$ 近似比。然而该算法难以随输入规模扩展。在流式处理场景中,虽能获得实际高质量解,但其空间复杂度与全集规模 $|\mathcal{U}|$ 呈线性关系。已有随机流式算法可输出 $1-1/e-\varepsilon$ 近似最优解,且空间复杂度仅与 $m$ 和 $|\mathcal{U}|$ 呈多对数关系。为实现该低空间复杂度,作者采用基于独立哈希函数与 $F_0$ 草图技术的子采样方法。本文聚焦该亚线性空间算法,提出降低子采样时间开销的方法。首先给出不改变原算法空间复杂度、遍历次数与近似质量的若干优化,通过重新分析误差界将原算法 $\Omega(\varepsilon^{-2} k \log m)$ 的独立因子精调至 $\Omega(k \log m)$。其次证明 $F_0$ 草图可被更简洁的机制替代。最后实验表明,即便采用两两独立的哈希函数采样器,其求解质量也不劣于原算法,但运行速度提升数个数量级。