Large language models (LLMs) have brought significant changes to human society. Softmax regression and residual neural networks (ResNet) are two important techniques in deep learning: they not only serve as significant theoretical components supporting the functionality of LLMs but also are related to many other machine learning and theoretical computer science fields, including but not limited to image classification, object detection, semantic segmentation, and tensors. Previous research works studied these two concepts separately. In this paper, we provide a theoretical analysis of the regression problem: $\| \langle \exp(Ax) + A x , {\bf 1}_n \rangle^{-1} ( \exp(Ax) + Ax ) - b \|_2^2$, where $A$ is a matrix in $\mathbb{R}^{n \times d}$, $b$ is a vector in $\mathbb{R}^n$, and ${\bf 1}_n$ is the $n$-dimensional vector whose entries are all $1$. This regression problem is a unified scheme that combines softmax regression and ResNet, which has never been done before. We derive the gradient, Hessian, and Lipschitz properties of the loss function. The Hessian is shown to be positive semidefinite, and its structure is characterized as the sum of a low-rank matrix and a diagonal matrix. This enables an efficient approximate Newton method. As a result, this unified scheme helps to connect two previously thought unrelated fields and provides novel insight into loss landscape and optimization for emerging over-parameterized neural networks, which is meaningful for future research in deep learning models.
翻译:大型语言模型(LLM)已为人类社会带来重大变革。Softmax回归与残差神经网络(ResNet)是深度学习中的两项重要技术:它们不仅作为支撑大语言模型功能的关键理论组件,还与图像分类、目标检测、语义分割和张量等多个机器学习及理论计算机科学领域密切相关。以往的研究工作将这两个概念分开讨论。本文对以下回归问题进行了理论分析:$\| \langle \exp(Ax) + A x , {\bf 1}_n \rangle^{-1} ( \exp(Ax) + Ax ) - b \|_2^2$,其中$A$为$\mathbb{R}^{n \times d}$中的矩阵,$b$为$\mathbb{R}^n$中的向量,${\bf 1}_n$表示所有元素均为1的$n$维向量。该回归问题将Softmax回归与ResNet统一在同一框架内,此研究此前尚属空白。我们推导了损失函数的梯度、Hessian矩阵及Lipschitz性质。Hessian矩阵被证明是半正定的,其结构可表征为低秩矩阵与对角矩阵之和。这一特性使得高效的近似牛顿方法得以实现。最终,该统一框架有助于连接此前被认为互不相关的两个领域,并为新兴过参数化神经网络的损失景观与优化问题提供了新颖见解,这对深度学习模型的未来研究具有重要价值。