While the empirical success of self-supervised learning (SSL) heavily relies on the usage of deep nonlinear models, existing theoretical works on SSL understanding still focus on linear ones. In this paper, we study the role of nonlinearity in the training dynamics of contrastive learning (CL) on one and two-layer nonlinear networks with homogeneous activation $h(x) = h'(x)x$. We have two major theoretical discoveries. First, the presence of nonlinearity can lead to many local optima even in 1-layer setting, each corresponding to certain patterns from the data distribution, while with linear activation, only one major pattern can be learned. This suggests that models with lots of parameters can be regarded as a \emph{brute-force} way to find these local optima induced by nonlinearity. Second, in the 2-layer case, linear activation is proven not capable of learning specialized weights into diverse patterns, demonstrating the importance of nonlinearity. In addition, for 2-layer setting, we also discover \emph{global modulation}: those local patterns discriminative from the perspective of global-level patterns are prioritized to learn, further characterizing the learning process. Simulation verifies our theoretical findings.
翻译:尽管自监督学习的实证成功高度依赖于深度非线性模型的使用,但现有的关于自监督学习的理论工作仍聚焦于线性模型。本文研究在采用满足激活函数$h(x) = h'(x)x$的同质激活函数的单层和双层非线性网络上,对比学习训练动态中非线性所扮演的角色。我们有两个主要理论发现。首先,即使在单层设置中,非线性的存在也可能导致大量局部最优解,每个解对应数据分布中的特定模式,而采用线性激活函数时,仅能学习到一个主要模式。这表明,包含大量参数的模型可被视为一种用于寻找由非线性诱导产生的这些局部最优解的“暴力”方法。其次,在双层情况下,线性激活函数被证明无法将权重特化为多样化的模式,从而凸显了非线性的重要性。此外,针对双层设置,我们还发现了“全局调制”现象:那些从全局模式角度看具有区分性的局部模式会被优先学习,这进一步刻画了学习过程。仿真实验验证了我们的理论发现。