Optimal Transport has sparked vivid interest in recent years, in particular thanks to the Wasserstein distance, which provides a geometrically sensible and intuitive way of comparing probability measures. For computational reasons, the Sliced Wasserstein (SW) distance was introduced as an alternative to the Wasserstein distance, and has seen uses for training generative Neural Networks (NNs). While convergence of Stochastic Gradient Descent (SGD) has been observed practically in such a setting, there is to our knowledge no theoretical guarantee for this observation. Leveraging recent works on convergence of SGD on non-smooth and non-convex functions by Bianchi et al. (2022), we aim to bridge that knowledge gap, and provide a realistic context under which fixed-step SGD trajectories for the SW loss on NN parameters converge. More precisely, we show that the trajectories approach the set of (sub)-gradient flow equations as the step decreases. Under stricter assumptions, we show a much stronger convergence result for noised and projected SGD schemes, namely that the long-run limits of the trajectories approach a set of generalised critical points of the loss function.
翻译:最优传输近年来引起了广泛关注,特别是由于Wasserstein距离提供了几何上合理且直观的概率测度比较方式。出于计算原因,Sliced Wasserstein(SW)距离被引入作为Wasserstein距离的替代方案,并已用于训练生成式神经网络(NN)。尽管在实际训练中观察到随机梯度下降(SGD)在此类设置下的收敛性,但据我们所知,目前尚无理论保证。基于Bianchi等人(2022)关于非光滑非凸函数SGD收敛性的最新工作,我们旨在弥合这一知识差距,并为神经网络参数上SW损失的固定步长SGD轨迹收敛提供一个现实背景。具体而言,我们证明当步长减小时,轨迹趋近于(次)梯度流方程的解集。在更严格的假设下,我们展示了噪声和投影SGD方案的更强收敛结果,即轨迹的长期极限趋近于损失函数广义临界点的集合。