The softmax activation function plays a crucial role in the success of large language models (LLMs), particularly in the self-attention mechanism of the widely adopted Transformer architecture. However, the underlying learning dynamics that contribute to the effectiveness of softmax remain largely unexplored. As a step towards better understanding, this paper provides a theoretical study of the optimization and generalization properties of two-layer softmax neural networks, providing theoretical insights into their superior performance as other activation functions, such as ReLU and exponential. Leveraging the Neural Tangent Kernel (NTK) framework, our analysis reveals that the normalization effect of the softmax function leads to a good perturbation property of the induced NTK matrix, resulting in a good convex region of the loss landscape. Consequently, softmax neural networks can learn the target function in the over-parametrization regime. To demonstrate the broad applicability of our theoretical findings, we apply them to the task of learning score estimation functions in diffusion models, a promising approach for generative modeling. Our analysis shows that gradient-based algorithms can learn the score function with a provable accuracy. Our work provides a deeper understanding of the effectiveness of softmax neural networks and their potential in various domains, paving the way for further advancements in natural language processing and beyond.
翻译:Softmax激活函数在大型语言模型(LLM)的成功中发挥着关键作用,特别是在广泛采用的Transformer架构的自注意力机制中。然而,促成Softmax有效性的底层学习动态在很大程度上仍未得到探索。作为深入理解的一步,本文对两层Softmax神经网络的优化与泛化特性进行了理论研究,从理论上阐释了其相对于ReLU和指数等其他激活函数的优越性能。基于神经正切核(NTK)框架,我们的分析表明,Softmax函数的归一化效应导致其诱导的NTK矩阵具有良好的扰动特性,从而在损失景观中形成良好的凸区域。因此,Softmax神经网络能够在过参数化机制中学习目标函数。为展示我们理论发现的广泛适用性,我们将其应用于学习扩散模型中的分数估计函数这一任务——一种有前景的生成建模方法。我们的分析表明,基于梯度的算法能够以可证明的精度学习分数函数。本研究为深入理解Softmax神经网络的有效性及其在多领域的潜力提供了理论基础,为自然语言处理及其他领域的进一步发展铺平了道路。