We focus on the classification problem with a separable dataset, one of the most important and classical problems from machine learning. The standard approach to this task is logistic regression with gradient descent (LR+GD). Recent studies have observed that LR+GD can find a solution with arbitrarily large step sizes, defying conventional optimization theory. Our work investigates this phenomenon and makes three interconnected key observations about LR+GD with large step sizes. First, we find a remarkably simple explanation of why LR+GD with large step sizes solves the classification problem: LR+GD reduces to a batch version of the celebrated perceptron algorithm when the step size $\gamma \to \infty.$ Second, we observe that larger step sizes lead LR+GD to higher logistic losses when it tends to the perceptron algorithm, but larger step sizes also lead to faster convergence to a solution for the classification problem, meaning that logistic loss is an unreliable metric of the proximity to a solution. Surprisingly, high loss values can actually indicate faster convergence. Third, since the convergence rate in terms of loss function values of LR+GD is unreliable, we examine the iteration complexity required by LR+GD with large step sizes to solve the classification problem and prove that this complexity is suboptimal. To address this, we propose a new method, Normalized LR+GD - based on the connection between LR+GD and the perceptron algorithm - with much better theoretical guarantees.
翻译:我们聚焦于可分离数据集的分类问题,这是机器学习中最重要且经典的问题之一。该任务的标准方法是逻辑回归结合梯度下降(LR+GD)。近期研究观察到,LR+GD 可以在任意大学习率下找到解,这与传统优化理论相悖。我们的工作研究了这一现象,并对大学习率下的 LR+GD 提出了三个相互关联的关键发现。首先,我们找到了一个非常简单的解释来说明为何大学习率下的 LR+GD 能解决分类问题:当学习率 $\gamma \to \infty$ 时,LR+GD 退化为著名的感知机算法的批量版本。其次,我们观察到,当 LR+GD 趋近于感知机算法时,更大的学习率会导致更高的逻辑损失,但更大的学习率也使得分类问题的解收敛得更快,这意味着逻辑损失并非衡量接近解程度的可靠指标。令人惊讶的是,高损失值实际上可能预示着更快的收敛。第三,由于 LR+GD 在损失函数值方面的收敛速率不可靠,我们考察了大学习率下 LR+GD 解决分类问题所需的迭代复杂度,并证明该复杂度是次优的。为解决此问题,我们基于 LR+GD 与感知机算法之间的联系,提出了一种新方法——归一化 LR+GD,该方法具有更好的理论保证。