Optimal Dynamic Subset Sampling: Theory and Applications

We study the fundamental problem of sampling independent events, called subset sampling. Specifically, consider a set of $n$ events $S=\{x_1, \ldots, x_n\}$, where each event $x_i$ has an associated probability $p(x_i)$. The subset sampling problem aims to sample a subset $T \subseteq S$, such that every $x_i$ is independently included in $S$ with probability $p_i$. A naive solution is to flip a coin for each event, which takes $O(n)$ time. However, the specific goal is to develop data structures that allow drawing a sample in time proportional to the expected output size $\mu=\sum_{i=1}^n p(x_i)$, which can be significantly smaller than $n$ in many applications. The subset sampling problem serves as an important building block in many tasks and has been the subject of various research for more than a decade. However, most of the existing subset sampling approaches are conducted in a static setting, where the events or their associated probability in set $S$ is not allowed to be changed over time. These algorithms incur either large query time or update time in a dynamic setting despite the ubiquitous time-evolving events with changing probability in real life. Therefore, it is a pressing need, but still, an open problem, to design efficient dynamic subset sampling algorithms. In this paper, we propose ODSS, the first optimal dynamic subset sampling algorithm. The expected query time and update time of ODSS are both optimal, matching the lower bounds of the subset sampling problem. We present a nontrivial theoretical analysis to demonstrate the superiority of ODSS. We also conduct comprehensive experiments to empirically evaluate the performance of ODSS. Moreover, we apply ODSS to a concrete application: influence maximization. We empirically show that our ODSS can improve the complexities of existing influence maximization algorithms on large real-world evolving social networks.

翻译：我们研究了采样独立事件这一基本问题，称为子集采样。具体而言，考虑一个包含 $n$ 个事件的集合 $S=\{x_1, \ldots, x_n\}$，其中每个事件 $x_i$ 关联一个概率 $p(x_i)$。子集采样问题的目标是采样一个子集 $T \subseteq S$，使得每个 $x_i$ 以概率 $p_i$ 独立地被包含在 $S$ 中。一种朴素的方法是对每个事件抛掷一枚硬币，这需要 $O(n)$ 时间。然而，具体目标是开发数据结构，使得采样时间与期望输出大小 $\mu=\sum_{i=1}^n p(x_i)$ 成正比，而在许多应用中，这一期望输出大小可能远小于 $n$。子集采样问题在许多任务中作为重要的构建模块，并已成为十多年来各种研究的主题。然而，大多数现有的子集采样方法是在静态设置中进行的，其中事件或其关联概率在集合 $S$ 中不允许随时间变化。尽管现实世界中普遍存在概率随时间变化的事件，但这些算法在动态设置中会导致较大的查询时间或更新时间。因此，设计高效的动态子集采样算法是一个紧迫的需求，但仍是一个开放问题。在本文中，我们提出了 ODSS，这是首个最优的动态子集采样算法。ODSS 的期望查询时间和更新时间都是最优的，匹配子集采样问题的下界。我们提出了一个非平凡的理论分析来证明 ODSS 的优越性。我们还进行了全面的实验，以实证评估 ODSS 的性能。此外，我们将 ODSS 应用于一个具体场景：影响力最大化。我们通过实验表明，我们的 ODSS 可以提高现有影响力最大化算法在大型真实世界演化社交网络中的复杂度。