Optimal Dynamic Subset Sampling: Theory and Applications

We study the fundamental problem of sampling independent events, called subset sampling. Specifically, consider a set of $n$ events $S=\{x_1, \ldots, x_n\}$, where each event $x_i$ has an associated probability $p(x_i)$. The subset sampling problem aims to sample a subset $T \subseteq S$, such that every $x_i$ is independently included in $S$ with probability $p_i$. A naive solution is to flip a coin for each event, which takes $O(n)$ time. However, the specific goal is to develop data structures that allow drawing a sample in time proportional to the expected output size $\mu=\sum_{i=1}^n p(x_i)$, which can be significantly smaller than $n$ in many applications. The subset sampling problem serves as an important building block in many tasks and has been the subject of various research for more than a decade. However, most of the existing subset sampling approaches are conducted in a static setting, where the events or their associated probability in set $S$ is not allowed to be changed over time. These algorithms incur either large query time or update time in a dynamic setting despite the ubiquitous time-evolving events with changing probability in real life. Therefore, it is a pressing need, but still, an open problem, to design efficient dynamic subset sampling algorithms. In this paper, we propose ODSS, the first optimal dynamic subset sampling algorithm. The expected query time and update time of ODSS are both optimal, matching the lower bounds of the subset sampling problem. We present a nontrivial theoretical analysis to demonstrate the superiority of ODSS. We also conduct comprehensive experiments to empirically evaluate the performance of ODSS. Moreover, we apply ODSS to a concrete application: influence maximization. We empirically show that our ODSS can improve the complexities of existing influence maximization algorithms on large real-world evolving social networks.

翻译：我们研究了采样独立事件这一基本问题，称为子集采样。具体而言，考虑一个包含 $n$ 个事件 $S=\{x_1, \ldots, x_n\}$ 的集合，其中每个事件 $x_i$ 具有关联概率 $p(x_i)$。子集采样问题的目标是以概率 $p_i$ 独立地将每个 $x_i$ 包含在 $S$ 中，从而采样一个子集 $T \subseteq S$。一种朴素解决方案是为每个事件抛掷一枚硬币，这需要 $O(n)$ 时间。然而，具体目标是开发数据结构，使得采样时间与期望输出大小 $\mu=\sum_{i=1}^n p(x_i)$ 成比例，而在许多应用中，$\mu$ 可能远小于 $n$。子集采样问题作为许多任务中的重要构建模块，已受到十余年的广泛研究。然而，现有的大多数子集采样方法都是在静态设置下进行的，其中集合 $S$ 中的事件或其关联概率不允许随时间变化。尽管现实生活中的事件及其概率普遍随时间动态变化，这些算法在动态设置中会导致较大的查询时间或更新时间。因此，设计高效的动态子集采样算法是一个迫切需求，但至今仍是一个开放问题。在本文中，我们提出了 ODSS，这是首个最优的动态子集采样算法。ODSS 的期望查询时间和更新时间均为最优，达到了子集采样问题的下界。我们通过非平凡的理论分析证明了 ODSS 的优越性。我们还进行了全面的实验，以实证评估 ODSS 的性能。此外，我们将 ODSS 应用于一个具体场景：影响力最大化。实验结果表明，在大型真实世界演化的社交网络上，我们的 ODSS 能够改进现有影响力最大化算法的复杂度。