This paper demonstrates that grokking behavior in modular arithmetic with a modulus P in a neural network can be controlled by modifying the profile of the activation function as well as the depth and width of the model. Plotting the even PCA projections of the weights of the last NN layer against their odd projections further yields patterns which become significantly more uniform when the nonlinearity is increased by incrementing the number of layers. These patterns can be employed to factor P when P is nonprime. Finally, a metric for the generalization ability of the network is inferred from the entropy of the layer weights while the degree of nonlinearity is related to correlations between the local entropy of the weights of the neurons in the final layer.
翻译:本文证明,通过调整激活函数的特性以及模型的深度和宽度,可以控制神经网络在模P的模运算中出现的grokking行为。将神经网络最后一层权重的偶主成分分析投影与其奇投影进行绘图,进一步显示出当通过增加层数来增强非线性时,这些图案会变得更加均匀。当P为非素数时,这些图案可用于对P进行因数分解。最后,从层权重的熵推断出网络泛化能力的度量,而非线性程度则与最终层神经元权重的局部熵之间的相关性有关。