Recently, convolutional neural networks (CNNs) with large size kernels have attracted much attention in the computer vision field, following the success of the Vision Transformers. Large kernel CNNs have been reported to perform well in downstream vision tasks as well as in classification performance. The reason for the high-performance of large kernel CNNs in downstream tasks has been attributed to the large effective receptive field (ERF) produced by large size kernels, but this view has not been fully tested. We therefore revisit the performance of large kernel CNNs in downstream task, focusing on the weakly supervised object localization (WSOL) task. WSOL, a difficult downstream task that is not fully supervised, provides a new angle to explore the capabilities of the large kernel CNNs. Our study compares the modern large kernel CNNs ConvNeXt, RepLKNet, and SLaK to test the validity of the naive expectation that ERF size is important for improving downstream task performance. Our analysis of the factors contributing to high performance provides a different perspective, in which the main factor is feature map improvement. Furthermore, we find that modern CNNs are robust to the CAM problems of local regions of objects being activated, which has long been discussed in WSOL. CAM is the most classic WSOL method, but because of the above-mentioned problems, it is often used as a baseline method for comparison. However, experiments on the CUB-200-2011 dataset show that simply combining a large kernel CNN, CAM, and simple data augmentation methods can achieve performance (90.99% MaxBoxAcc) comparable to the latest WSOL method, which is CNN-based and requires special training or complex post-processing. The code is available at https://github.com/snskysk/CAM-Back-Again.
翻译:近期,随着视觉Transformer的成功,拥有大尺寸卷积核的卷积神经网络(CNN)在计算机视觉领域引起了广泛关注。据报道,大核CNN在下游视觉任务和分类性能方面均表现出色。大核CNN在下游任务中高性能的原因被归结为大尺寸卷积核产生的大有效感受野(ERF),但这种观点尚未得到充分验证。因此,我们从弱监督目标定位(WSOL)任务出发重新审视大核CNN在下游任务中的表现。WSOL作为一种非完全监督的困难下游任务,为探索大核CNN的能力提供了新视角。本研究通过对比现代大核CNN模型ConvNeXt、RepLKNet和SLaK,检验了“ERF尺寸对提升下游任务性能至关重要”这一直观预期的有效性。通过对高贡献因素的分析,我们提出了不同观点,认为主要因素在于特征图的改进。此外,我们发现现代CNN对于WSOL中长期讨论的“目标局部区域被激活”的CAM问题具有鲁棒性。CAM是最经典的WSOL方法,但由于上述问题常被用作比较的基线方法。然而,在CUB-200-2011数据集上的实验表明,仅需将大核CNN、CAM与简单数据增强方法相结合,即可达到与最新WSOL方法(基于CNN且需特殊训练或复杂后处理)相媲美的性能(90.99% MaxBoxAcc)。代码开源地址:https://github.com/snskysk/CAM-Back-Again。