This article studies the fundamental problem of using i.i.d. coin tosses from an entropy source to efficiently generate random variables $X_i \sim P_i$ $(i \ge 1)$, where $(P_1, P_2, \dots)$ is a random sequence of rational discrete probability distributions subject to an \textit{arbitrary} stochastic process. Our method achieves an amortized expected entropy cost within $\varepsilon > 0$ bits of the information-theoretically optimal Shannon lower bound using $O(\log(1/\varepsilon))$ space. This result holds both pointwise in terms of the Shannon information content conditioned on $X_i$ and $P_i$, and in expectation to obtain a rate of $\mathbb{E}[H(P_1) + \dots + H(P_n)]/n + \varepsilon$ bits per sample as $n \to \infty$ (where $H$ is the Shannon entropy). The combination of space, time, and entropy properties of our method improves upon the Knuth and Yao (1976) entropy-optimal algorithm and Han and Hoshi (1997) interval algorithm for online sampling, which require unbounded space. It also uses exponentially less space than the more specialized methods of Kozen and Soloviev (2022) and Shao and Wang (2025) that generate i.i.d. samples from a fixed distribution. Our online sampling algorithm rests on a powerful algorithmic technique called \textit{randomness recycling}, which reuses a fraction of the random information consumed by a probabilistic algorithm to reduce its amortized entropy cost. On the practical side, we develop randomness recycling techniques to accelerate a variety of prominent sampling algorithms. We show that randomness recycling enables state-of-the-art runtime performance on the Fisher-Yates shuffle when using a cryptographically secure pseudorandom number generator, and that it reduces the entropy cost of discrete Gaussian sampling. Accompanying the manuscript is a performant software library in the C programming language.
翻译:本文研究一个基础性问题:如何利用来自熵源的独立同分布硬币抛掷结果,高效地生成服从分布 $X_i \sim P_i$ $(i \ge 1)$ 的随机变量,其中 $(P_1, P_2, \dots)$ 是服从任意随机过程的有理离散概率分布随机序列。我们的方法以 $O(\log(1/\varepsilon))$ 的空间复杂度,实现了摊销期望熵成本在信息论最优香农下界 $\varepsilon > 0$ 比特范围内。该结果在以下两方面均成立:在基于 $X_i$ 和 $P_i$ 条件的香农信息内容点态意义上,以及在期望意义上达到 $\mathbb{E}[H(P_1) + \dots + H(P_n)]/n + \varepsilon$ 比特每样本的速率(当 $n \to \infty$,其中 $H$ 表示香农熵)。本方法在空间、时间与熵特性上的综合表现,优于 Knuth 与 Yao(1976)提出的熵最优算法以及 Han 与 Hoshi(1997)针对在线采样的区间算法(二者均需无限空间)。相较于 Kozen 与 Soloviev(2022)以及 Shao 与 Wang(2025)专门针对固定分布生成独立同分布样本的方法,本方法所需空间呈指数级减少。我们的在线采样算法基于一种称为随机性回收的强大算法技术,该技术通过重复利用概率算法所消耗的部分随机信息来降低其摊销熵成本。在实践层面,我们开发了随机性回收技术以加速多种重要采样算法。我们证明,在使用密码学安全伪随机数生成器时,随机性回收能使 Fisher-Yates 洗牌算法达到最先进的运行时性能,同时降低离散高斯采样的熵成本。本文附有以 C 编程语言实现的高性能软件库。