In recent years, Dynamic Sparse Training (DST) has emerged as an alternative to post-training pruning for generating efficient models. In principle, DST allows for a more memory efficient training process, as it maintains sparsity throughout the entire training run. However, current DST implementations fail to capitalize on this in practice. Because sparse matrix multiplication is much less efficient than dense matrix multiplication on GPUs, most implementations simulate sparsity by masking weights. In this paper, we leverage recent advances in semi-structured sparse training to apply DST in the domain of classification with large output spaces, where memory-efficiency is paramount. With a label space of possibly millions of candidates, the classification layer alone will consume several gigabytes of memory. Switching from a dense to a fixed fan-in sparse layer updated with sparse evolutionary training (SET); however, severely hampers training convergence, especially at the largest label spaces. We find that poor gradient flow from the sparse classifier to the dense text encoder make it difficult to learn good input representations. By employing an intermediate layer or adding an auxiliary training objective, we recover most of the generalisation performance of the dense model. Overall, we demonstrate the applicability and practical benefits of DST in a challenging domain -- characterized by a highly skewed label distribution that differs substantially from typical DST benchmark datasets -- which enables end-to-end training with millions of labels on commodity hardware.
翻译:近年来,动态稀疏训练(DST)已成为训练后剪枝的一种替代方案,用于生成高效模型。原则上,DST可以实现更高效的内存训练过程,因为它在整个训练运行期间都保持稀疏性。然而,当前的DST实现未能充分利用这一点。由于在GPU上稀疏矩阵乘法的效率远低于密集矩阵乘法,大多数实现通过掩码权重来模拟稀疏性。在本文中,我们利用半结构化稀疏训练的最新进展,将DST应用于具有大输出空间的分类领域,其中内存效率至关重要。在可能包含数百万候选标签的空间中,仅分类层就会消耗数GB的内存。然而,从密集层切换到使用稀疏进化训练(SET)更新的固定扇入稀疏层,会严重阻碍训练收敛,尤其是在最大的标签空间下。我们发现,从稀疏分类器到密集文本编码器的梯度流不佳,使得难以学习良好的输入表示。通过采用中间层或添加辅助训练目标,我们恢复了密集模型的大部分泛化性能。总体而言,我们证明了DST在一个具有挑战性的领域——其特点是具有与典型DST基准数据集显著不同的高度偏斜标签分布——中的适用性和实际优势,这使得能够在商用硬件上对数百万标签进行端到端训练。