For obtaining optimal first-order convergence guarantee for stochastic optimization, it is necessary to use a recurrent data sampling algorithm that samples every data point with sufficient frequency. Most commonly used data sampling algorithms (e.g., i.i.d., MCMC, random reshuffling) are indeed recurrent under mild assumptions. In this work, we show that for a particular class of stochastic optimization algorithms, we do not need any other property (e.g., independence, exponential mixing, and reshuffling) than recurrence in data sampling algorithms to guarantee the optimal rate of first-order convergence. Namely, using regularized versions of Minimization by Incremental Surrogate Optimization (MISO), we show that for non-convex and possibly non-smooth objective functions, the expected optimality gap converges at an optimal rate $O(n^{-1/2})$ under general recurrent sampling schemes. Furthermore, the implied constant depends explicitly on the `speed of recurrence', measured by the expected amount of time to visit a given data point either averaged (`target time') or supremized (`hitting time') over the current location. We demonstrate theoretically and empirically that convergence can be accelerated by selecting sampling algorithms that cover the data set most effectively. We discuss applications of our general framework to decentralized optimization and distributed non-negative matrix factorization.
翻译:为获得随机优化的一阶收敛性最优保证,必须使用循环数据采样算法,该算法需以足够频率采样每个数据点。最常用的数据采样算法(如独立同分布采样、马尔可夫链蒙特卡洛法、随机重排法)在温和假设下确实具有循环性。本研究表明,对于特定类别的随机优化算法,除数据采样算法的循环性外,我们无需依赖其他性质(如独立性、指数混合性、重排性)即可保证一阶收敛的最优速率。具体而言,通过采用正则化版本的增量代理优化最小化算法,我们证明对于非凸且可能非光滑的目标函数,在一般循环采样方案下,期望最优性间隙以最优速率$O(n^{-1/2})$收敛。此外,隐含常数显式依赖于“循环速度”——该指标通过以当前位置为基准,对访问给定数据点所需期望时间取平均(“目标时间”)或取上确界(“命中时间”)来度量。我们从理论和实验两方面证明,通过选择能最有效覆盖数据集的采样算法可以加速收敛。最后,我们探讨了该通用框架在去中心化优化与分布式非负矩阵分解中的应用。