Large language models (LLMs), such as ChatGPT and GPT4, have shown outstanding performance in many human life task. Attention computation plays an important role in training LLMs. Softmax unit and ReLU unit are the key structure in attention computation. Inspired by them, we put forward a softmax ReLU regression problem. Generally speaking, our goal is to find an optimal solution to the regression problem involving the ReLU unit. In this work, we calculate a close form representation for the Hessian of the loss function. Under certain assumptions, we prove the Lipschitz continuous and the PSDness of the Hessian. Then, we introduce an greedy algorithm based on approximate Newton method, which converges in the sense of the distance to optimal solution. Last, We relax the Lipschitz condition and prove the convergence in the sense of loss value.
翻译:大型语言模型(LLMs),如ChatGPT和GPT4,在众多人类生活任务中展现出卓越性能。注意力计算是训练LLMs的关键环节,其中Softmax单元和ReLU单元构成注意力计算的核心结构。受此启发,我们提出一个Softmax-ReLU回归问题。总体而言,我们的目标是寻找涉及ReLU单元的回归问题的最优解。本研究推导了损失函数Hessian矩阵的闭式表达式。在特定假设条件下,我们证明了该Hessian矩阵满足Lipschitz连续性且为半正定矩阵。进而,我们提出基于近似牛顿法的贪心算法,该算法在最优解距离意义下收敛。最后,我们放松Lipschitz条件,证明了损失值意义下的收敛性。