We consider the training of the first layer of vision models and notice the clear relationship between pixel values and gradient update magnitudes: the gradients arriving at the weights of a first layer are by definition directly proportional to (normalized) input pixel values. Thus, an image with low contrast has a smaller impact on learning than an image with higher contrast, and a very bright or very dark image has a stronger impact on the weights than an image with moderate brightness. In this work, we propose performing gradient descent on the embeddings produced by the first layer of the model. However, switching to discrete inputs with an embedding layer is not a reasonable option for vision models. Thus, we propose the conceptual procedure of (i) a gradient descent step on first layer activations to construct an activation proposal, and (ii) finding the optimal weights of the first layer, i.e., those weights which minimize the squared distance to the activation proposal. We provide a closed form solution of the procedure and adjust it for robust stochastic training while computing everything efficiently. Empirically, we find that TrAct (Training Activations) speeds up training by factors between 1.25x and 4x while requiring only a small computational overhead. We demonstrate the utility of TrAct with different optimizers for a range of different vision models including convolutional and transformer architectures.
翻译:我们研究了视觉模型第一层的训练过程,并注意到像素值与梯度更新幅度之间存在明确关系:根据定义,到达第一层权重的梯度与(归一化后的)输入像素值直接成正比。因此,低对比度图像对学习的影响小于高对比度图像,而极亮或极暗图像对权重的影响强于亮度适中的图像。在本工作中,我们提出对模型第一层产生的嵌入表示执行梯度下降。然而,对于视觉模型而言,改用带有嵌入层的离散输入并非合理方案。因此,我们提出以下概念性流程:(i)对第一层激活执行梯度下降步骤以构建激活提案,(ii)寻找第一层的最优权重,即最小化与激活提案之间平方距离的权重。我们给出了该流程的闭式解,并通过高效计算调整方案以实现鲁棒的随机训练。实验表明,TrAct(训练激活)能以1.25倍至4倍的系数加速训练过程,且仅需少量计算开销。我们通过不同优化器在包括卷积和Transformer架构在内的多种视觉模型上验证了TrAct的有效性。