Deep Ensembling with No Overhead for either Training or Testing: The All-Round Blessings of Dynamic Sparsity

The success of deep ensembles on improving predictive performance, uncertainty estimation, and out-of-distribution robustness has been extensively studied in the machine learning literature. Albeit the promising results, naively training multiple deep neural networks and combining their predictions at inference leads to prohibitive computational costs and memory requirements. Recently proposed efficient ensemble approaches reach the performance of the traditional deep ensembles with significantly lower costs. However, the training resources required by these approaches are still at least the same as training a single dense model. In this work, we draw a unique connection between sparse neural network training and deep ensembles, yielding a novel efficient ensemble learning framework called FreeTickets. Instead of training multiple dense networks and averaging them, we directly train sparse subnetworks from scratch and extract diverse yet accurate subnetworks during this efficient, sparse-to-sparse training. Our framework, FreeTickets, is defined as the ensemble of these relatively cheap sparse subnetworks. Despite being an ensemble method, FreeTickets has even fewer parameters and training FLOPs than a single dense model. This seemingly counter-intuitive outcome is due to the ultra training/inference efficiency of dynamic sparse training. FreeTickets surpasses the dense baseline in all the following criteria: prediction accuracy, uncertainty estimation, out-of-distribution (OoD) robustness, as well as efficiency for both training and inference. Impressively, FreeTickets outperforms the naive deep ensemble with ResNet50 on ImageNet using around only 1/5 of the training FLOPs required by the latter. We have released our source code at https://github.com/VITA-Group/FreeTickets.

翻译：深度集成在提升预测性能、不确定性估计和分布外鲁棒性方面的成功已在机器学习文献中得到广泛研究。尽管结果令人鼓舞，但朴素地训练多个深度神经网络并在推理时组合其预测会导致极高的计算成本和内存需求。近期提出的高效集成方法以显著更低的成本达到了传统深度集成的性能。然而，这些方法所需的训练资源仍至少等同于训练单个稠密模型。本工作中，我们揭示了稀疏神经网络训练与深度集成之间的独特联系，提出了一种名为FreeTickets的新型高效集成学习框架。该框架不再训练多个稠密网络并对其取平均，而是直接从头训练稀疏子网络，并在此高效的稀疏到稀疏训练过程中提取多样且精确的子网络。我们的FreeTickets框架即定义为这些相对廉价的稀疏子网络的集成。尽管是一种集成方法，FreeTickets的参数数量和训练浮点运算量甚至少于单个稠密模型。这一看似反直觉的结果源于动态稀疏训练的超高训练/推理效率。FreeTickets在以下所有指标上均超越稠密基线：预测准确率、不确定性估计、分布外（OoD）鲁棒性，以及训练和推理效率。令人印象深刻的是，FreeTickets在使用仅约后者1/5训练浮点运算量的情况下，在ImageNet数据集上使用ResNet50超越了朴素深度集成。我们已在https://github.com/VITA-Group/FreeTickets发布源代码。