Optimal Dynamic Subset Sampling: Theory and Applications

We study the fundamental problem of sampling independent events, called subset sampling. Specifically, consider a set of $n$ events $S=\{x_1, \ldots, x_n\}$, where each event $x_i$ has an associated probability $p(x_i)$. The subset sampling problem aims to sample a subset $T \subseteq S$, such that every $x_i$ is independently included in $S$ with probability $p_i$. A naive solution is to flip a coin for each event, which takes $O(n)$ time. However, the specific goal is to develop data structures that allow drawing a sample in time proportional to the expected output size $\mu=\sum_{i=1}^n p(x_i)$, which can be significantly smaller than $n$ in many applications. The subset sampling problem serves as an important building block in many tasks and has been the subject of various research for more than a decade. However, most of the existing subset sampling approaches are conducted in a static setting, where the events or their associated probability in set $S$ is not allowed to be changed over time. These algorithms incur either large query time or update time in a dynamic setting despite the ubiquitous time-evolving events with changing probability in real life. Therefore, it is a pressing need, but still, an open problem, to design efficient dynamic subset sampling algorithms. In this paper, we propose ODSS, the first optimal dynamic subset sampling algorithm. The expected query time and update time of ODSS are both optimal, matching the lower bounds of the subset sampling problem. We present a nontrivial theoretical analysis to demonstrate the superiority of ODSS. We also conduct comprehensive experiments to empirically evaluate the performance of ODSS. Moreover, we apply ODSS to a concrete application: influence maximization. We empirically show that our ODSS can improve the complexities of existing influence maximization algorithms on large real-world evolving social networks.

翻译：我们研究了采样独立事件这一基础问题，称为子集采样。具体而言，考虑一个包含$n$个事件的集合$S=\{x_1, \ldots, x_n\}$，其中每个事件$x_i$关联一个概率$p(x_i)$。子集采样问题的目标是采样一个子集$T \subseteq S$，使得每个$x_i$以概率$p_i$独立地包含在$S$中。朴素解法是对每个事件抛硬币，时间复杂度为$O(n)$。然而，具体目标是开发数据结构，使得采样时间与期望输出规模$\mu=\sum_{i=1}^n p(x_i)$成正比，这在许多应用中可能远小于$n$。子集采样问题是众多任务中的重要构建模块，并且已在过去十多年中成为各种研究的主题。然而，现有的大多数子集采样方法在静态环境下进行，其中集合$S$中的事件及其关联概率不允许随时间变化。尽管现实世界中普遍存在概率随时间变化的事件，但现有算法在动态环境下往往产生较大的查询时间或更新时间。因此，设计高效的动态子集采样算法是一个紧迫但尚未解决的问题。在本文中，我们提出ODSS，这是首个最优的动态子集采样算法。ODSS的期望查询时间和期望更新时间均达到最优，匹配子集采样问题的下界。我们通过非平凡的理论分析展示了ODSS的优越性。我们还进行了全面的实验来实证评估ODSS的性能。此外，我们将ODSS应用于具体任务：影响力最大化。我们通过实验表明，在大型真实演化的社交网络中，我们的ODSS能够提升现有影响力最大化算法的复杂度。