Lottery tickets (LTs) is able to discover accurate and sparse subnetworks that could be trained in isolation to match the performance of dense networks. Ensemble, in parallel, is one of the oldest time-proven tricks in machine learning to improve performance by combining the output of multiple independent models. However, the benefits of ensemble in the context of LTs will be diluted since ensemble does not directly lead to stronger sparse subnetworks, but leverages their predictions for a better decision. In this work, we first observe that directly averaging the weights of the adjacent learned subnetworks significantly boosts the performance of LTs. Encouraged by this observation, we further propose an alternative way to perform an 'ensemble' over the subnetworks identified by iterative magnitude pruning via a simple interpolating strategy. We call our method Lottery Pools. In contrast to the naive ensemble which brings no performance gains to each single subnetwork, Lottery Pools yields much stronger sparse subnetworks than the original LTs without requiring any extra training or inference cost. Across various modern architectures on CIFAR-10/100 and ImageNet, we show that our method achieves significant performance gains in both, in-distribution and out-of-distribution scenarios. Impressively, evaluated with VGG-16 and ResNet-18, the produced sparse subnetworks outperform the original LTs by up to 1.88% on CIFAR-100 and 2.36% on CIFAR-100-C; the resulting dense network surpasses the pre-trained dense-model up to 2.22% on CIFAR-100 and 2.38% on CIFAR-100-C.
翻译:彩票券(LTs)能够发现精确且稀疏的子网络,这些子网络可以独立训练以匹配稠密网络的性能。并行集成是机器学习中经过时间验证的经典技巧之一,通过组合多个独立模型的输出来提升性能。然而,在彩票券场景下,集成的优势会被削弱,因为集成并不直接产生更强的稀疏子网络,而是利用其预测做出更优决策。在本工作中,我们首先观察到直接平均相邻已学习子网络的权重能显著提升彩票券的性能。受此现象启发,我们进一步提出一种替代方法,通过简单的插值策略对迭代幅度剪枝识别的子网络进行“集成”。我们将该方法命名为彩票池。与无法为单个子网络带来性能提升的朴素集成不同,彩票池能在无需额外训练或推理成本的情况下,生成比原始彩票券更强的稀疏子网络。在CIFAR-10/100和ImageNet数据集上的多种现代架构实验中,我们证明该方法在分布内和分布外场景中均能显著提升性能。令人印象深刻的是,在VGG-16和ResNet-18上评估时,生成的稀疏子网络在CIFAR-100上比原始彩票券提升高达1.88%,在CIFAR-100-C上提升2.36%;生成的稠密网络在CIFAR-100上超越预训练稠密模型高达2.22%,在CIFAR-100-C上超越2.38%。