Efficient Uniform Sampling of Surjections via their Profiles

In this article, we develop efficient sampling algorithms for random surjections from $[n]$ to $[k]$ for all $n \geq k$. We make no assumption about $n$ and $k$. In particular, we do not make the common assumption that the ratio $\frac{n}{k}$ is constant. All our guarantees are uniform in $n$ and $k$. Our first insight is that all the complexity in sampling random surjections is captured by sampling a smaller structure which we call the \emph{profile} of the surjection. More precisely, the profile associates to each occurring preimage size $s$ the number of preimages of size $s$. Using standard techniques, we show that the problem of sampling surjections reduces to sampling the profile with the induced distribution. This is partly explained by the fact that profiles are always sublinear, with at most $\sqrt{2n}$ entries in the worst case. We provide a complete set of algorithms to directly sample the \emph{profile} of a random surjection with the induced distribution, covering the full parameter space. These algorithms are shown to be optimal up to logarithmic factors in the expected size of the output. Our algorithms are based on exact-size Boltzmann samplers, which are standard rejection-based samplers. We partition the parameter space into three main regions. In each region, we optimize both the rejection rate and the cost of each sampling round. Profiles capture a number of relevant statistics of random surjections and might be of independent interest. In a related context, profiles have been recently studied by Devroye et al. for random mappings. As a spin-off result, we answer an open question from Devroye and Los '25 by providing an optimal algorithm also for the profiles of a random mapping when $k > n/\log n$. The results of this article are not only of theoretical interest but lead to samplers implementable in practice.

翻译：本文针对所有 $n \geq k$ 的情形，开发了从 $[n]$ 到 $[k]$ 的随机满射函数的高效采样算法。我们不对 $n$ 和 $k$ 施加任何假设，尤其不采用常见的 $\frac{n}{k}$ 为常数的假设。所有保证均关于 $n$ 和 $k$ 一致成立。首要洞察在于：随机满射函数采样的全部复杂度均可归结为对更小结构（即满射函数的"分布"）的采样。具体而言，该"分布"将每个出现原像规模 $s$ 与规模为 $s$ 的原像数量相关联。利用标准技术，我们证明满射函数采样问题可简化为按诱导分布采样其"分布"。这在一定程度上源于"分布"总是次线性的：最坏情况下最多包含 $\sqrt{2n}$ 个条目。我们提供了一套完整算法，可直接按诱导分布采样随机满射函数的"分布"，覆盖全部参数空间。这些算法在输出期望规模的对数因子意义下被证明是最优的。算法基于精确规模的玻尔兹曼采样器（一种标准的拒绝型采样器），并将参数空间划分为三个主要区域。在每个区域中，我们同时优化了拒绝率与每轮采样成本。"分布"捕捉了随机满射函数的若干相关统计量，可能具有独立研究价值。在相关背景下，Devroye等人近期针对随机映射研究了此类"分布"。作为衍生结果，我们回答了Devroye与Los '25提出的开放问题：当 $k > n/\log n$ 时，为随机映射的"分布"提供了最优算法。本文成果不仅具有理论意义，还可生成可实际部署的采样器。