MLPs at the EOC: Concentration of the NTK

We study the concentration of the Neural Tangent Kernel (NTK) $K_\theta : \mathbb{R}^{m_0} \times \mathbb{R}^{m_0} \to \mathbb{R}^{m_l \times m_l}$ of $l$-layer Multilayer Perceptrons (MLPs) $N : \mathbb{R}^{m_0} \times \Theta \to \mathbb{R}^{m_l}$ equipped with activation functions $\phi(s) = a s + b \vert s \vert$ for some $a,b \in \mathbb{R}$ with the parameter $\theta \in \Theta$ being initialized at the Edge Of Chaos (EOC). Without relying on the gradient independence assumption that has only been shown to hold asymptotically in the infinitely wide limit, we prove that an approximate version of gradient independence holds at finite width. Showing that the NTK entries $K_\theta(x_{i_1},x_{i_2})$ for $i_1,i_2 \in [1:n]$ over a dataset $\{x_1,\cdots,x_n\} \subset \mathbb{R}^{m_0}$ concentrate simultaneously via maximal inequalities, we prove that the NTK matrix $K(\theta) = [\frac{1}{n} K_\theta(x_{i_1},x_{i_2}) : i_1,i_2 \in [1:n]] \in \mathbb{R}^{nm_l \times nm_l}$ concentrates around its infinitely wide limit $\overset{\scriptscriptstyle\infty}{K} \in \mathbb{R}^{nm_l \times nm_l}$ without the need for linear overparameterization. Our results imply that in order to accurately approximate the limit, hidden layer widths have to grow quadratically as $m_k = k^2 m$ for some $m \in \mathbb{N}+1$ for sufficient concentration. For such MLPs, we obtain the concentration bound $\mathbb{P}( \Vert K(\theta) - \overset{\scriptscriptstyle\infty}{K} \Vert \leq O((\Delta_\phi^{-2} + m_l^{\frac{1}{2}} l) \kappa_\phi^2 m^{-\frac{1}{2}})) \geq 1-O(m^{-1})$ modulo logarithmic terms, where we denoted $\Delta_\phi = \frac{b^2}{a^2+b^2}$ and $\kappa_\phi = \frac{\vert a \vert + \vert b \vert}{\sqrt{a^2 + b^2}}$. This reveals in particular that the absolute value ($\Delta_\phi=1$, $\kappa_\phi=1$) beats the ReLU ($\Delta_\phi=\frac{1}{2}$, $\kappa_\phi=\sqrt{2}$) in terms of the concentration of the NTK.

翻译：我们研究了$l$层多层感知机（MLP）$N : \mathbb{R}^{m_0} \times \Theta \to \mathbb{R}^{m_l}$的神经正切核（NTK）$K_\theta : \mathbb{R}^{m_0} \times \mathbb{R}^{m_0} \to \mathbb{R}^{m_l \times m_l}$的集中性，该MLP配备激活函数$\phi(s) = a s + b \vert s \vert$（其中$a,b \in \mathbb{R}$），且参数$\theta \in \Theta$在混沌边缘（EOC）初始化。在不依赖于仅在无限宽极限下渐近成立的梯度独立性假设的前提下，我们证明了在有限宽度下，一个近似版本的梯度独立性成立。通过极大值不等式证明数据集$\{x_1,\cdots,x_n\} \subset \mathbb{R}^{m_0}$上所有$i_1,i_2 \in [1:n]$对应的NTK项$K_\theta(x_{i_1},x_{i_2})$同时集中，我们证明了NTK矩阵$K(\theta) = [\frac{1}{n} K_\theta(x_{i_1},x_{i_2}) : i_1,i_2 \in [1:n]] \in \mathbb{R}^{nm_l \times nm_l}$围绕其无限宽极限$\overset{\scriptscriptstyle\infty}{K} \in \mathbb{R}^{nm_l \times nm_l}$集中，且无需线性过参数化。我们的结果表明，为了精确逼近该极限，隐藏层宽度必须以$m_k = k^2 m$（其中$m \in \mathbb{N}+1$）的形式二次增长，以实现充分集中。对于此类MLP，我们得到集中界$\mathbb{P}( \Vert K(\theta) - \overset{\scriptscriptstyle\infty}{K} \Vert \leq O((\Delta_\phi^{-2} + m_l^{\frac{1}{2}} l) \kappa_\phi^2 m^{-\frac{1}{2}})) \geq 1-O(m^{-1})$（模去对数项），其中我们记$\Delta_\phi = \frac{b^2}{a^2+b^2}$，$\kappa_\phi = \frac{\vert a \vert + \vert b \vert}{\sqrt{a^2 + b^2}}$。这特别揭示了在NTK集中性方面，绝对值函数（$\Delta_\phi=1$，$\kappa_\phi=1$）优于ReLU函数（$\Delta_\phi=\frac{1}{2}$，$\kappa_\phi=\sqrt{2}$）。