In recent times, with the exception of sporadic cases, the trend in Computer Vision is to achieve minor improvements compared to considerable increases in complexity. To reverse this trend, we propose a novel method to boost image classification performances without increasing complexity. To this end, we revisited ensembling, a powerful approach, often not used properly due to its more complex nature and the training time, so as to make it feasible through a specific design choice. First, we trained two EfficientNet-b0 end-to-end models (known to be the architecture with the best overall accuracy/complexity trade-off for image classification) on disjoint subsets of data (i.e. bagging). Then, we made an efficient adaptive ensemble by performing fine-tuning of a trainable combination layer. In this way, we were able to outperform the state-of-the-art by an average of 0.5$\%$ on the accuracy, with restrained complexity both in terms of the number of parameters (by 5-60 times), and the FLoating point Operations Per Second (FLOPS) by 10-100 times on several major benchmark datasets.
翻译:近年来,除个别情况外,计算机视觉领域普遍存在以显著增加复杂度为代价换取微小幅度的性能提升趋势。为扭转这一趋势,我们提出一种新颖方法,在不增加复杂度的前提下提升图像分类性能。为此,我们重新审视集成学习这一强大技术——由于该技术本身较复杂的特性及训练时间开销,常常未得到恰当运用——通过特定的设计选择使其具备可行性。首先,我们在互不相交的数据子集(即装袋法)上分别训练两个EfficientNet-b0端到端模型(该架构被公认为在图像分类中具有最佳整体精度/复杂度权衡)。随后,通过微调可训练的组合层,构建高效自适应集成模型。采用这种方式,我们能够在多个主要基准数据集上,以参数数量缩减5-60倍、浮点运算次数(FLOPS)降低10-100倍的受限复杂度,将分类精度平均提升0.5%,从而超越当前最优方法。