Large language models (LLMs) have made transformed changes for human society. One of the key computation in LLMs is the softmax unit. This operation is important in LLMs because it allows the model to generate a distribution over possible next words or phrases, given a sequence of input words. This distribution is then used to select the most likely next word or phrase, based on the probabilities assigned by the model. The softmax unit plays a crucial role in training LLMs, as it allows the model to learn from the data by adjusting the weights and biases of the neural network. In the area of convex optimization such as using central path method to solve linear programming. The softmax function has been used a crucial tool for controlling the progress and stability of potential function [Cohen, Lee and Song STOC 2019, Brand SODA 2020]. In this work, inspired the softmax unit, we define a softmax regression problem. Formally speaking, given a matrix $A \in \mathbb{R}^{n \times d}$ and a vector $b \in \mathbb{R}^n$, the goal is to use greedy type algorithm to solve \begin{align*} \min_{x} \| \langle \exp(Ax), {\bf 1}_n \rangle^{-1} \exp(Ax) - b \|_2^2. \end{align*} In certain sense, our provable convergence result provides theoretical support for why we can use greedy algorithm to train softmax function in practice.
翻译:大型语言模型(LLMs)已给人类社会带来变革性变化。LLMs中的关键计算之一便是softmax单元。该操作在LLMs中至关重要,因为它能使模型根据输入词序列生成可能的下一个词或短语的概率分布。随后,该分布基于模型赋予的概率被用于选择最可能的下一个词或短语。Softmax单元在训练LLMs中扮演着核心角色,它使模型能够通过调整神经网络的权重和偏置从数据中学习。在凸优化领域,例如使用中心路径法求解线性规划时,softmax函数已被用作控制势函数进展与稳定性的关键工具[Cohen, Lee and Song STOC 2019, Brand SODA 2020]。受softmax单元启发,本文定义了一个softmax回归问题。形式化地,给定矩阵$A \in \mathbb{R}^{n \times d}$和向量$b \in \mathbb{R}^n$,目标是使用贪心类算法求解 \begin{align*} \min_{x} \| \langle \exp(Ax), {\bf 1}_n \rangle^{-1} \exp(Ax) - b \|_2^2. \end{align*} 在某种意义上,我们可证明的收敛性结果为在实践中使用贪心算法训练softmax函数提供了理论支撑。