Given a collection of $m$ sets from a universe $\mathcal{U}$, the Maximum Set Coverage problem consists of finding $k$ sets whose union has largest cardinality. This problem is NP-Hard, but the solution can be approximated by a polynomial time algorithm up to a factor $1-1/e$. However, this algorithm does not scale well with the input size. In a streaming context, practical high-quality solutions are found, but with space complexity that scales linearly with respect to the size of the universe $|\mathcal{U}|$. However, one randomized streaming algorithm has been shown to produce a $1-1/e-\varepsilon$ approximation of the optimal solution with a space complexity that scales only poly-logarithmically with respect to $m$ and $|\mathcal{U}|$. In order to achieve such a low space complexity, the authors used a technique called subsampling, based on independent-wise hash functions. This article focuses on this sublinear-space algorithm and introduces methods to reduce the time cost of subsampling. We first show how to accelerate by several orders of magnitude without altering the space complexity, number of passes and approximation quality of the original algorithm. Secondly, we derive a new lower bound for the probability of producing a $1-1/e-\varepsilon$ approximation using only pairwise independence: $1-\tfrac{4}{c k \log m}$ compared to the original $1-\tfrac{2e}{m^{ck/6}}$. Although the theoretical approximation guarantees are weaker, for large streams, our algorithm performs well in practice and present the best time-space-performance trade-off for maximum coverage in streams.
翻译:给定一个由集合族 $\mathcal{U}$ 中的 $m$ 个集合构成的集合族,最大集合覆盖问题旨在找出 $k$ 个集合,使得它们的并集基数最大。该问题是NP难问题,但可通过多项式时间算法获得因子为 $1-1/e$ 的近似解。然而,该算法难以随输入规模扩展。在流式处理场景中,已有算法能够找到实用高质量解,但其空间复杂度随宇宙规模 $|\mathcal{U}|$ 线性增长。尽管存在一种随机流式算法,能以仅与 $m$ 和 $|\mathcal{U}|$ 呈多对数关系的空间复杂度,产生 $1-1/e-\varepsilon$ 近似的最优解。为实现如此低的空间复杂度,作者采用了一种基于独立哈希函数的子采样技术。本文聚焦于该亚线性空间算法,并引入降低子采样时间成本的方法。我们首先展示了如何在不改变原始算法空间复杂度、扫描次数和近似质量的前提下,将速度提升数个数量级。其次,我们推导出仅使用两两独立哈希函数时,能产生 $1-1/e-\varepsilon$ 近似解的概率新下界:$1-\tfrac{4}{c k \log m}$,而原下界为 $1-\tfrac{2e}{m^{ck/6}}$。尽管理论近似保证有所减弱,但对于大规模流数据,我们的算法在实践中表现优异,并在流式最大覆盖问题上实现了最佳的时间-空间-性能权衡。