This article derives and validates three principles for initialization and architecture selection in finite width graph neural networks (GNNs) with ReLU activations. First, we theoretically derive what is essentially the unique generalization to ReLU GNNs of the well-known He-initialization. Our initialization scheme guarantees that the average scale of network outputs and gradients remains order one at initialization. Second, we prove in finite width vanilla ReLU GNNs that oversmoothing is unavoidable at large depth when using fixed aggregation operator, regardless of initialization. We then prove that using residual aggregation operators, obtained by interpolating a fixed aggregation operator with the identity, provably alleviates oversmoothing at initialization. Finally, we show that the common practice of using residual connections with a fixup-type initialization provably avoids correlation collapse in final layer features at initialization. Through ablation studies we find that using the correct initialization, residual aggregation operators, and residual connections in the forward pass significantly and reliably speeds up early training dynamics in deep ReLU GNNs on a variety of tasks.
翻译:本文推导并验证了有限宽度ReLU激活图神经网络(GNN)中初始化与架构选择的三个原则。首先,我们从理论上推导出著名的He初始化在ReLU GNN中本质上唯一的推广形式。所提出的初始化方案确保网络输出和梯度的平均尺度在初始化时保持阶一量级。其次,我们证明在有限宽度标准ReLU GNN中,无论采用何种初始化方法,当使用固定聚合算子时,过平滑在大深度下不可避免。随后我们证明,通过将固定聚合算子与恒等映射相结合得到残差聚合算子,可在初始化时有效缓解过平滑。最后,我们表明,采用fixup型初始化的残差连接这一常见做法,能够在初始化时避免最终层特征的关联性崩溃。通过消融实验发现,在前向传播中采用正确的初始化、残差聚合算子及残差连接,可在多种任务上显著且可靠地加速深度ReLU GNN的早期训练动态。