Recent results on optimization and generalization properties of neural networks showed that in a simple two-layer network, the alignment of the labels to the eigenvectors of the corresponding Gram matrix determines the convergence of the optimization during training. Such analyses also provide upper bounds on the generalization error. We experimentally investigate the implications of these results to deeper networks via embeddings. We regard the layers preceding the final hidden layer as producing different representations of the input data which are then fed to the two-layer model. We show that these representations improve both optimization and generalization. In particular, we investigate three kernel representations when fed to the final hidden layer: the Gaussian kernel and its approximation by random Fourier features, kernels designed to imitate representations produced by neural networks and finally an optimal kernel designed to align the data with target labels. The approximated representations induced by these kernels are fed to the neural network and the optimization and generalization properties of the final model are evaluated and compared.
翻译:关于神经网络优化与泛化性质的最新研究结果表明,在简单两层网络中,标签与对应Gram矩阵特征向量的对齐程度决定了训练过程中优化的收敛性。此类分析同样提供了泛化误差的上界。我们通过嵌入方法,在实验层面探究了这些结论对深层网络的影响。我们将最终隐藏层之前的各层视为生成输入数据的不同表示,这些表示随后被输入两层模型。研究表明,这些表示同时改善了优化与泛化性能。具体而言,我们考察了三种核表示在输入最终隐藏层时的表现:高斯核及其随机傅里叶特征近似、模仿神经网络生成表示的核函数,以及旨在使数据与目标标签对齐的最优核函数。这些核诱导的近似表示被输入神经网络后,我们评估并比较了最终模型的优化与泛化特性。