Optimal Dynamic Subset Sampling: Theory and Applications

We study the fundamental problem of sampling independent events, called subset sampling. Specifically, consider a set of $n$ events $S=\{x_1, \ldots, x_n\}$, where each event $x_i$ has an associated probability $p(x_i)$. The subset sampling problem aims to sample a subset $T \subseteq S$, such that every $x_i$ is independently included in $S$ with probability $p_i$. A naive solution is to flip a coin for each event, which takes $O(n)$ time. However, the specific goal is to develop data structures that allow drawing a sample in time proportional to the expected output size $\mu=\sum_{i=1}^n p(x_i)$, which can be significantly smaller than $n$ in many applications. The subset sampling problem serves as an important building block in many tasks and has been the subject of various research for more than a decade. However, most of the existing subset sampling approaches are conducted in a static setting, where the events or their associated probability in set $S$ is not allowed to be changed over time. These algorithms incur either large query time or update time in a dynamic setting despite the ubiquitous time-evolving events with changing probability in real life. Therefore, it is a pressing need, but still, an open problem, to design efficient dynamic subset sampling algorithms. In this paper, we propose ODSS, the first optimal dynamic subset sampling algorithm. The expected query time and update time of ODSS are both optimal, matching the lower bounds of the subset sampling problem. We present a nontrivial theoretical analysis to demonstrate the superiority of ODSS. We also conduct comprehensive experiments to empirically evaluate the performance of ODSS. Moreover, we apply ODSS to a concrete application: influence maximization. We empirically show that our ODSS can improve the complexities of existing influence maximization algorithms on large real-world evolving social networks.

翻译：我们研究了采样独立事件这一基本问题，称为子集采样。具体而言，考虑一个包含 $n$ 个事件的集合 $S=\{x_1, \ldots, x_n\}$，其中每个事件 $x_i$ 关联一个概率 $p(x_i)$。子集采样问题的目标是采样一个子集 $T \subseteq S$，使得每个 $x_i$ 独立地以概率 $p_i$ 被包含在 $S$ 中。一种朴素解法是对每个事件抛掷一枚硬币，时间复杂度为 $O(n)$。然而，具体目标是开发一种数据结构，使得采样时间与期望输出大小 $\mu=\sum_{i=1}^n p(x_i)$ 成正比，而 $\mu$ 在许多应用中可能远小于 $n$。子集采样问题作为许多任务的重要构建模块，已受到十余年的广泛研究。然而，现有的大多数子集采样方法都是在静态设置下进行的，其中集合 $S$ 中的事件或其关联概率不允许随时间变化。尽管现实中普遍存在概率随时间演变的时间动态事件，这些算法在动态设置下会导致较大的查询时间或更新时间。因此，设计高效的动态子集采样算法成为一个迫切但尚未解决的问题。在本文中，我们提出了ODSS，这是首个最优动态子集采样算法。ODSS的期望查询时间和更新时间均达到最优，与子集采样问题的下界相匹配。我们通过非平凡的理论分析证明了ODSS的优越性，并进行了全面的实验以实证评估其性能。此外，我们将ODSS应用于一个具体场景：影响力最大化。实验证明，在大型真实演化社交网络上，ODSS能够改进现有影响力最大化算法的复杂度。