We consider the problem of sampling a multimodal distribution with a Markov chain given a small number of samples from the stationary measure. Although mixing can be arbitrarily slow, we show that if the Markov chain has a $k$th order spectral gap, initialization from a set of $\tilde O(k/\varepsilon^2)$ samples from the stationary distribution will, with high probability over the samples, efficiently generate a sample whose conditional law is $\varepsilon$-close in TV distance to the stationary measure. In particular, this applies to mixtures of $k$ distributions satisfying a Poincar\'e inequality, with faster convergence when they satisfy a log-Sobolev inequality. Our bounds are stable to perturbations to the Markov chain, and in particular work for Langevin diffusion over $\mathbb R^d$ with score estimation error, as well as Glauber dynamics combined with approximation error from pseudolikelihood estimation. This justifies the success of data-based initialization for score matching methods despite slow mixing for the data distribution, and improves and generalizes the results of Koehler and Vuong (2023) to have linear, rather than exponential, dependence on $k$ and apply to arbitrary semigroups. As a consequence of our results, we show for the first time that a natural class of low-complexity Ising measures can be efficiently learned from samples.
翻译:我们考虑在给定平稳测度少量样本的情况下,利用马尔可夫链对多模态分布进行采样的问题。尽管混合过程可能任意缓慢,但我们证明:若马尔可夫链具有$k$阶谱间隙,则从平稳分布的$\tilde O(k/\varepsilon^2)$个样本集合中初始化,将能以高概率(基于样本)高效生成条件律在总变差距离上$\varepsilon$接近平稳测度的样本。该结论特别适用于满足庞加莱不等式的$k$个分布的混合情形,当这些分布满足对数索伯列夫不等式时收敛速度更快。我们的边界对马尔可夫链的扰动具有稳定性,特别适用于存在分数估计误差的$\mathbb R^d$上朗之万扩散,以及结合伪似然估计近似误差的Glauber动力学。这从理论上解释了尽管数据分布混合缓慢,基于数据初始化的分数匹配方法仍能成功的原因,同时改进并推广了Koehler与Vuong(2023)的结果——将$k$的依赖关系从指数级降低至线性,并适用于任意半群。作为我们结论的推论,首次证明了一类自然的低复杂度伊辛测度可以从样本中高效学习。