Using gradient descent (GD) with fixed or decaying step-size is a standard practice in unconstrained optimization problems. However, when the loss function is only locally convex, such a step-size schedule artificially slows GD down as it cannot explore the flat curvature of the loss function. To overcome that issue, we propose to exponentially increase the step-size of the GD algorithm. Under homogeneous assumptions on the loss function, we demonstrate that the iterates of the proposed \emph{exponential step size gradient descent} (EGD) algorithm converge linearly to the optimal solution. Leveraging that optimization insight, we then consider using the EGD algorithm for solving parameter estimation under both regular and non-regular statistical models whose loss function becomes locally convex when the sample size goes to infinity. We demonstrate that the EGD iterates reach the final statistical radius within the true parameter after a logarithmic number of iterations, which is in stark contrast to a \emph{polynomial} number of iterations of the GD algorithm in non-regular statistical models. Therefore, the total computational complexity of the EGD algorithm is \emph{optimal} and exponentially cheaper than that of the GD for solving parameter estimation in non-regular statistical models while being comparable to that of the GD in regular statistical settings. To the best of our knowledge, it resolves a long-standing gap between statistical and algorithmic computational complexities of parameter estimation in non-regular statistical models. Finally, we provide targeted applications of the general theory to several classes of statistical models, including generalized linear models with polynomial link functions and location Gaussian mixture models.
翻译:采用固定或衰减步长的梯度下降法是无约束优化问题的标准实践。然而,当损失函数仅局部凸时,此类步长调度会人为地减慢梯度下降法,使其无法探索损失函数的平坦曲率。为解决这一问题,我们提出指数增加梯度下降法的步长。在损失函数齐次性假设下,我们证明了所提出的指数步长梯度下降(EGD)算法的迭代序列线性收敛至最优解。基于此优化洞见,我们进一步考虑将EGD算法应用于正则与非正则统计模型的参数估计问题——当样本量趋于无穷时,其损失函数会呈现局部凸性。我们证明,EGD迭代仅需对数级迭代次数即可达到真实参数周围的最终统计半径,这与非正则统计模型中梯度下降法所需的_多项式级迭代次数形成鲜明对比。因此,EGD算法的总计算复杂度在非正则统计模型参数估计中具有_最优性,且相比梯度下降法呈指数级降低,而在正则统计场景中则与梯度下降法相当。据我们所知,这解决了非正则统计模型参数估计中统计复杂度与算法计算复杂度之间长期存在的鸿沟。最后,我们将该通用理论应用于多类统计模型,包括具有多项式链接函数的广义线性模型和位置高斯混合模型。