This paper studies the \emph{subset sampling} problem. The input is a set $\mathcal{S}$ of $n$ records together with a function $\textbf{p}$ that assigns each record $v\in\mathcal{S}$ a probability $\textbf{p}(v)$. A query returns a random subset $X$ of $\mathcal{S}$, where each record $v\in\mathcal{S}$ is sampled into $X$ independently with probability $\textbf{p}(v)$. The goal is to store $\mathcal{S}$ in a data structure to answer queries efficiently. If $\mathcal{S}$ fits in memory, the problem is interesting when $\mathcal{S}$ is dynamic. We develop a dynamic data structure with $\mathcal{O}(1+\mu_{\mathcal{S}})$ expected \emph{query} time, $\mathcal{O}(n)$ space and $\mathcal{O}(1)$ amortized expected \emph{update}, \emph{insert} and \emph{delete} time, where $\mu_{\mathcal{S}}=\sum_{v\in\mathcal{S}}\textbf{p}(v)$. The query time and space are optimal. If $\mathcal{S}$ does not fit in memory, the problem is difficult even if $\mathcal{S}$ is static. Under this scenario, we present an I/O-efficient algorithm that answers a \emph{query} in $\mathcal{O}\left((\log^*_B n)/B+(\mu_\mathcal{S}/B)\log_{M/B} (n/B)\right)$ amortized expected I/Os using $\mathcal{O}(n/B)$ space, where $M$ is the memory size, $B$ is the block size and $\log^*_B n$ is the number of iterative $\log_2(.)$ operations we need to perform on $n$ before going below $B$. In addition, when each record is associated with a real-valued key, we extend the \emph{subset sampling} problem to the \emph{range subset sampling} problem, in which we require that the keys of the sampled records fall within a specified input range $[a,b]$. For this extension, we provide a solution under the dynamic setting, with $\mathcal{O}(\log n+\mu_{\mathcal{S}\cap[a,b]})$ expected \emph{query} time, $\mathcal{O}(n)$ space and $\mathcal{O}(\log n)$ amortized expected \emph{update}, \emph{insert} and \emph{delete} time.
翻译:本文研究子集采样问题。输入包含一个包含n个记录的集合S,以及一个为每个记录v∈S分配概率p(v)的函数p。查询将返回S的一个随机子集X,其中每个记录v∈S以概率p(v)独立地被采样到X中。目标是存储S于数据结构中以高效响应查询。若S可装入内存,则当S为动态时该问题具有研究价值。我们提出一种动态数据结构,其期望查询时间为O(1+μ_S),空间复杂度为O(n),摊还期望更新、插入和删除时间为O(1),其中μ_S = ∑_{v∈S} p(v)。该查询时间与空间复杂度均为最优。若S无法装入内存,则即使S为静态,该问题也颇具挑战性。在此场景下,我们提出一种I/O高效算法,其以O(n/B)空间实现每次查询的摊还期望I/O为O((log*_B n)/B + (μ_S/B) log_{M/B}(n/B)),其中M为内存大小,B为块大小,log*_B n表示将n反复进行log₂(.)运算直至低于B所需的迭代次数。此外,当每个记录关联实数值键时,我们将子集采样问题扩展为范围子集采样问题,即要求采样记录的键落在指定输入范围[a,b]内。针对此扩展,我们在动态场景下提出一种解决方案,其期望查询时间为O(log n + μ_{S∩[a,b]}),空间复杂度为O(n),摊还期望更新、插入和删除时间为O(log n)。