Large language models (LLMs) have brought about significant transformations in human society. Among the crucial computations in LLMs, the softmax unit holds great importance. Its helps the model generating a probability distribution on potential subsequent words or phrases, considering a series of input words. By utilizing this distribution, the model selects the most probable next word or phrase, based on the assigned probabilities. The softmax unit assumes a vital function in LLM training as it facilitates learning from data through the adjustment of neural network weights and biases. With the development of the size of LLMs, computing the gradient becomes expensive. However, Zero-th Order method can approximately compute the gradient with only forward passes. In this paper, we present a Zero-th Order algorithm specifically tailored for Softmax optimization. We demonstrate the convergence of our algorithm, highlighting its effectiveness in efficiently computing gradients for large-scale LLMs. By leveraging the Zeroth-Order method, our work contributes to the advancement of optimization techniques in the context of complex language models.
翻译:大型语言模型(LLMs)已为人类社会带来重大变革。在LLMs的关键计算中,Softmax单元具有至关重要的意义——它帮助模型在考虑输入词序列后,为潜在后续词或短语生成概率分布。通过利用该分布,模型依据分配的概率选择最可能的后续词或短语。Softmax单元在LLM训练中发挥着核心作用,它通过调整神经网络权重与偏置实现从数据中学习。随着LLM规模的扩展,梯度计算成本日益高昂。然而,零阶方法仅需前向传播即可近似计算梯度。本文提出一种专用于Softmax优化的零阶算法,我们证明了该算法的收敛性,凸显其在高效计算大规模LLM梯度方面的有效性。通过利用零阶方法,本研究为复杂语言模型背景下优化技术的发展做出了贡献。