Large language models (LLMs) have numerous real-life applications across various domains, such as natural language translation, sentiment analysis, language modeling, chatbots and conversational agents, creative writing, text classification, summarization, and generation. LLMs have shown great promise in improving the accuracy and efficiency of these tasks, and have the potential to revolutionize the field of natural language processing (NLP) in the years to come. Exponential function based attention unit is a fundamental element in LLMs. Several previous works have studied the convergence of exponential regression and softmax regression. In this paper, we propose an iterative algorithm to solve a rescaled version of the slightly different formulation of the softmax regression problem that arises in attention mechanisms of large language models. Specifically, we consider minimizing the squared loss between a certain function, which can be either the exponential function, hyperbolic sine function, or hyperbolic cosine function, and its inner product with a target $n$-dimensional vector $b$, scaled by the normalization term. This ``rescaled softmax regression'' differs from classical softmax regression in the location of the normalization factor. The efficiency and generalizability of this framework to multiple hyperbolic functions make it relevant for optimizing attention mechanisms. The analysis also leads to a corollary bounding solution changes under small perturbations for in-context learning. Limitations and societal impact are discussed.
翻译:大型语言模型(LLM)在自然语言翻译、情感分析、语言建模、聊天机器人及对话代理、创意写作、文本分类、摘要生成等诸多领域具有广泛的实际应用。LLM在提升这些任务的准确性与效率方面展现出巨大潜力,并有望在未来几年彻底改变自然语言处理(NLP)领域。基于指数函数的注意力单元是LLM中的基本构成要素。已有若干研究工作探讨了指数回归与softmax回归的收敛性。本文提出一种迭代算法,用于求解大型语言模型注意力机制中产生的、经过重缩放且形式略有差异的softmax回归问题。具体而言,我们考虑最小化某一函数(可为指数函数、双曲正弦函数或双曲余弦函数)与其与目标$n$维向量$b$的内积之间的平方损失,该损失由归一化项进行缩放。这种“重缩放softmax回归”与经典softmax回归的区别在于归一化因子的位置。该框架对多种双曲函数的效率与泛化能力使其对优化注意力机制具有重要意义。相关分析还推导出一个推论,可用于界定上下文学习中微小扰动下的解变化范围。文中亦讨论了本方法的局限性及其社会影响。