Dual Space Preconditioning for Gradient Descent in the Overparameterized Regime

In this work we study the convergence properties of the Dual Space Preconditioned Gradient Descent, encompassing optimizers such as Normalized Gradient Descent, Gradient Clipping and Adam. We consider preconditioners of the form $\nabla K$, where $K: \mathbb{R}^p \to \mathbb{R}$ is convex and assume that the latter is applied to train an over-parameterized linear model with loss of the form $\ell({X} {W} - {Y})$, for weights ${W} \in \mathbb{R}^{d \times k}$, labels ${Y} \in \mathbb{R}^{n \times k}$ and data ${X} \in \mathbb{R}^{n \times d}$. Under the aforementioned assumptions, we prove that the iterates of the preconditioned gradient descent always converge to a point ${W}_{\infty} \in \mathbb{R}^{d \times k}$ satisfying ${X}{W}_{\infty} = {Y}$. Our proof techniques are of independent interest as we introduce a novel version of the Bregman Divergence with accompanying identities that allow us to establish convergence. We also study the implicit bias of Dual Space Preconditioned Gradient Descent. First, we demonstrate empirically that, for general $K(\cdot)$, ${W}_\infty$ depends on the chosen learning rate, hindering a precise characterization of the implicit bias. Then, for preconditioners of the form $K({G}) = h(\|{G}\|_F)$, known as \textit{isotropic preconditioners}, we show that ${W}_\infty$ minimizes $\|{W}_\infty - {W}_0\|_F^2$ subject to ${X}{W}_\infty = {Y}$, where ${W}_0$ is the initialization. Denoting the convergence point of GD initialized at ${W}_0$ by ${W}_{\text{GD}, \infty}$, we thus note ${W}_{\infty} = {W}_{\text{GD}, \infty}$ for isotropic preconditioners. Finally, we show that a similar fact holds for general preconditioners up to a multiplicative constant, namely, $\|{W}_0 - {W}_{\infty}\|_F \le c \|{W}_0 - {W}_{\text{GD}, \infty}\|_F$ for a constant $c>0$.

翻译：本文研究了对偶空间预处理梯度下降的收敛性质，该方法涵盖了归一化梯度下降、梯度裁剪和Adam等优化器。我们考虑形式为$\nabla K$的预处理算子，其中$K: \mathbb{R}^p \to \mathbb{R}$为凸函数，并假设该算子用于训练过参数化线性模型，其损失函数形式为$\ell({X} {W} - {Y})$，权重${W} \in \mathbb{R}^{d \times k}$，标签${Y} \in \mathbb{R}^{n \times k}$，数据${X} \in \mathbb{R}^{n \times d}$。在上述假设下，我们证明预处理梯度下降的迭代点始终收敛于满足${X}{W}_{\infty} = {Y}$的点${W}_{\infty} \in \mathbb{R}^{d \times k}$。我们的证明技术具有独立价值，因为我们引入了一种新形式的Bregman散度及其恒等式，从而建立了收敛性。我们还研究了对偶空间预处理梯度下降的隐式偏差。首先，通过实验证明对于一般的$K(\cdot)$，${W}_\infty$依赖于所选学习率，这阻碍了对隐式偏差的精确刻画。其次，对于形式为$K({G}) = h(\|{G}\|_F)$的预处理算子（称为\textit{各向同性预处理算子}），我们证明${W}_\infty$在满足${X}{W}_\infty = {Y}$的条件下最小化$\|{W}_\infty - {W}_0\|_F^2$，其中${W}_0$为初始化点。记从${W}_0$初始化的梯度下降收敛点为${W}_{\text{GD}, \infty}$，我们注意到对于各向同性预处理算子有${W}_{\infty} = {W}_{\text{GD}, \infty}$。最后，我们证明对于一般预处理算子存在类似结论（相差一个乘法常数），即存在常数$c>0$使得$\|{W}_0 - {W}_{\infty}\|_F \le c \|{W}_0 - {W}_{\text{GD}, \infty}\|_F$。