With increasing scale in model and dataset size, the training of deep neural networks becomes a massive computational burden. One approach to speed up the training process is Selective Backprop. For this approach, we perform a forward pass to obtain a loss value for each data point in a minibatch. The backward pass is then restricted to a subset of that minibatch, prioritizing high-loss examples. We build on this approach, but seek to improve the subset selection mechanism by choosing the (weighted) subset which best matches the mean gradient over the entire minibatch. We use the gradients w.r.t. the model's last layer as a cheap proxy, resulting in virtually no overhead in addition to the forward pass. At the same time, for our experiments we add a simple random selection baseline which has been absent from prior work. Surprisingly, we find that both the loss-based as well as the gradient-matching strategy fail to consistently outperform the random baseline.
翻译:摘要:随着模型和数据集规模的不断扩大,深度神经网络的训练成为巨大的计算负担。加速训练过程的一种方法是选择性反向传播。该方法中,我们通过前向传播获取小批量中每个数据点的损失值,随后仅对该小批量的子集(优先选择高损失样本)执行反向传播。我们在该方法基础上,试图通过选择(加权)子集来改进子集选择机制,使其与整个小批量的平均梯度最佳匹配。我们利用模型最后一层关于权重的梯度作为廉价代理,这几乎不产生前向传播之外的额外开销。同时,在实验中我们加入了先前工作中缺失的简单随机选择基线。令人惊讶的是,我们发现基于损失的策略和梯度匹配策略均未能持续优于随机基线。