Over the past few years, there has been a significant amount of research focused on studying the ReLU activation function, with the aim of achieving neural network convergence through over-parametrization. However, recent developments in the field of Large Language Models (LLMs) have sparked interest in the use of exponential activation functions, specifically in the attention mechanism. Mathematically, we define the neural function $F: \mathbb{R}^{d \times m} \times \mathbb{R}^d \rightarrow \mathbb{R}$ using an exponential activation function. Given a set of data points with labels $\{(x_1, y_1), (x_2, y_2), \dots, (x_n, y_n)\} \subset \mathbb{R}^d \times \mathbb{R}$ where $n$ denotes the number of the data. Here $F(W(t),x)$ can be expressed as $F(W(t),x) := \sum_{r=1}^m a_r \exp(\langle w_r, x \rangle)$, where $m$ represents the number of neurons, and $w_r(t)$ are weights at time $t$. It's standard in literature that $a_r$ are the fixed weights and it's never changed during the training. We initialize the weights $W(0) \in \mathbb{R}^{d \times m}$ with random Gaussian distributions, such that $w_r(0) \sim \mathcal{N}(0, I_d)$ and initialize $a_r$ from random sign distribution for each $r \in [m]$. Using the gradient descent algorithm, we can find a weight $W(T)$ such that $\| F(W(T), X) - y \|_2 \leq \epsilon$ holds with probability $1-\delta$, where $\epsilon \in (0,0.1)$ and $m = \Omega(n^{2+o(1)}\log(n/\delta))$. To optimize the over-parameterization bound $m$, we employ several tight analysis techniques from previous studies [Song and Yang arXiv 2019, Munteanu, Omlor, Song and Woodruff ICML 2022].
翻译:过去几年中,大量研究聚焦于ReLU激活函数,旨在通过过参数化实现神经网络的收敛。然而,大型语言模型(LLMs)领域的最新进展引发了人们对指数激活函数的兴趣,特别是在注意力机制中的应用。数学上,我们利用指数激活函数定义神经函数 $F: \mathbb{R}^{d \times m} \times \mathbb{R}^d \rightarrow \mathbb{R}$。给定一组带有标签的数据点 $\{(x_1, y_1), (x_2, y_2), \dots, (x_n, y_n)\} \subset \mathbb{R}^d \times \mathbb{R}$,其中 $n$ 表示数据数量。这里 $F(W(t),x)$ 可表示为 $F(W(t),x) := \sum_{r=1}^m a_r \exp(\langle w_r, x \rangle)$,其中 $m$ 代表神经元数量,$w_r(t)$ 为时刻 $t$ 的权重。文献中标准做法是固定权重 $a_r$ 并在训练过程中保持不变。我们使用随机高斯分布初始化权重 $W(0) \in \mathbb{R}^{d \times m}$,使得 $w_r(0) \sim \mathcal{N}(0, I_d)$,并对每个 $r \in [m]$ 从随机符号分布中初始化 $a_r$。通过梯度下降算法,我们可以找到权重 $W(T)$,使得 $\| F(W(T), X) - y \|_2 \leq \epsilon$ 以概率 $1-\delta$ 成立,其中 $\epsilon \in (0,0.1)$ 且 $m = \Omega(n^{2+o(1)}\log(n/\delta))$。为优化过参数化界 $m$,我们采用了以往研究中的若干紧致分析技术 [Song and Yang arXiv 2019, Munteanu, Omlor, Song and Woodruff ICML 2022]。