We study the maximum set coverage problem in the massively parallel model. In this setting, $m$ sets that are subsets of a universe of $n$ elements are distributed among $m$ machines. In each round, these machines can communicate with each other, subject to the memory constraint that no machine may use more than $\tilde{O}(n)$ memory. The objective is to find the $k$ sets whose coverage is maximized. We consider the regime where $k = \Omega(m)$, $m = O(n)$, and each machine has $\tilde{O}(n)$ memory. Maximum coverage is a special case of the submodular maximization problem subject to a cardinality constraint. This problem can be approximated to within a $1-1/e$ factor using the greedy algorithm, but this approach is not directly applicable to parallel and distributed models. When $k = \Omega(m)$, to obtain a $1-1/e-\epsilon$ approximation, previous work either requires $\tilde{O}(mn)$ memory per machine which is not interesting compared to the trivial algorithm that sends the entire input to a single machine, or requires $2^{O(1/\epsilon)} n$ memory per machine which is prohibitively expensive even for a moderately small value $\epsilon$. Our result is a randomized $(1-1/e-\epsilon)$-approximation algorithm that uses $O(1/\epsilon^3 \cdot \log m \cdot (\log (1/\epsilon) + \log m))$ rounds. Our algorithm involves solving a slightly transformed linear program of the maximum coverage problem using the multiplicative weights update method, classic techniques in parallel computing such as parallel prefix, and various combinatorial arguments.
翻译:本文研究大规模并行计算模型中的最大集合覆盖问题。在该设定下,$m$个属于$n$元素全域子集的集合被分配到$m$台机器上。每轮计算中,这些机器可在满足内存约束的条件下相互通信,即任何机器使用的内存不得超过$\tilde{O}(n)$。问题的目标是找到覆盖范围最大的$k$个集合。我们研究$k = \Omega(m)$、$m = O(n)$且每台机器具有$\tilde{O}(n)$内存的机制。最大覆盖问题是基数约束下子模最大化问题的特例,虽然贪婪算法可实现$1-1/e$的近似比,但该方法无法直接应用于并行与分布式模型。当$k = \Omega(m)$时,为获得$1-1/e-\epsilon$近似解,先前研究要么要求每台机器具有$\tilde{O}(mn)$内存(相较于将全部输入发送至单台机器的平凡算法无显著优势),要么要求每台机器具有$2^{O(1/\epsilon)} n$内存(即使对于中等大小的$\epsilon$值也代价过高)。我们提出了一种随机化的$(1-1/e-\epsilon)$近似算法,该算法使用$O(1/\epsilon^3 \cdot \log m \cdot (\log (1/\epsilon) + \log m))$轮计算。我们的算法通过乘性权重更新法求解经适度转换的最大覆盖线性规划问题,并结合并行前缀和等经典并行计算技术及多种组合论证方法。