Non-convex sampling is a key challenge in machine learning, central to non-convex optimization in deep learning as well as to approximate probabilistic inference. Despite its significance, theoretically there remain many important challenges: Existing guarantees (1) typically only hold for the averaged iterates rather than the more desirable last iterates, (2) lack convergence metrics that capture the scales of the variables such as Wasserstein distances, and (3) mainly apply to elementary schemes such as stochastic gradient Langevin dynamics. In this paper, we develop a new framework that lifts the above issues by harnessing several tools from the theory of dynamical systems. Our key result is that, for a large class of state-of-the-art sampling schemes, their last-iterate convergence in Wasserstein distances can be reduced to the study of their continuous-time counterparts, which is much better understood. Coupled with standard assumptions of MCMC sampling, our theory immediately yields the last-iterate Wasserstein convergence of many advanced sampling schemes such as proximal, randomized mid-point, and Runge-Kutta integrators. Beyond existing methods, our framework also motivates more efficient schemes that enjoy the same rigorous guarantees.
翻译:非凸采样是机器学习中的关键挑战,它既是深度学习中非凸优化的核心,也是近似概率推断的基础。尽管其重要性不言而喻,但理论上仍存在许多重要难题:现有保证(1)通常仅适用于平均迭代而非更优的末次迭代,(2)缺乏能够捕捉变量尺度(如Wasserstein距离)的收敛度量,(3)主要适用于随机梯度朗之万动力学等基本方案。本文通过利用动力系统理论中的多种工具,构建了一个新框架以解决上述问题。我们的核心结论是:对于一大类先进采样方案,其在Wasserstein距离下的末次迭代收敛性可归结为对其连续时间对应方法的研究——而后者已被充分理解。结合马尔可夫链蒙特卡洛采样的标准假设,该理论立即推导出近端积分器、随机中点积分器及龙格-库塔积分器等先进采样方案的末次迭代Wasserstein收敛性。除现有方法外,本框架还催生了兼具严格数学保证的更高效率方案。