Dilated Convolution with Learnable Spacing (DCLS) is a recent advanced convolution method that allows enlarging the receptive fields (RF) without increasing the number of parameters, like the dilated convolution, yet without imposing a regular grid. DCLS has been shown to outperform the standard and dilated convolutions on several computer vision benchmarks. Here, we show that, in addition, DCLS increases the models' interpretability, defined as the alignment with human visual strategies. To quantify it, we use the Spearman correlation between the models' GradCAM heatmaps and the ClickMe dataset heatmaps, which reflect human visual attention. We took eight reference models - ResNet50, ConvNeXt (T, S and B), CAFormer, ConvFormer, and FastViT (sa 24 and 36) - and drop-in replaced the standard convolution layers with DCLS ones. This improved the interpretability score in seven of them. Moreover, we observed that Grad-CAM generated random heatmaps for two models in our study: CAFormer and ConvFormer models, leading to low interpretability scores. We addressed this issue by introducing Threshold-Grad-CAM, a modification built on top of Grad-CAM that enhanced interpretability across nearly all models. The code and checkpoints to reproduce this study are available at: https://github.com/rabihchamas/DCLS-GradCAM-Eval.
翻译:可学习间距的扩张卷积(DCLS)是一种近期提出的先进卷积方法,它能够在像扩张卷积那样不增加参数数量的情况下扩大感受野,同时不强制施加规则网格。DCLS已被证明在多个计算机视觉基准测试中优于标准卷积和扩张卷积。本文进一步表明,DCLS还能提高模型的可解释性,即可解释性定义为模型与人类视觉策略的一致性。为了量化这一点,我们使用模型的Grad-CAM热图与反映人类视觉注意力的ClickMe数据集热图之间的斯皮尔曼相关系数。我们选取了八个参考模型——ResNet50、ConvNeXt(T、S和B)、CAFormer、ConvFormer以及FastViT(sa 24和36)——并通过原位替换的方式,将其标准卷积层替换为DCLS层。这提高了其中七个模型的可解释性得分。此外,我们观察到,在我们的研究中,Grad-CAM为两个模型——CAFormer和ConvFormer模型——生成了随机热图,导致其可解释性得分较低。我们通过引入Threshold-Grad-CAM解决了这个问题,这是一种基于Grad-CAM的改进方法,几乎在所有模型中都提升了可解释性。用于复现本研究的代码和检查点可在以下网址获取:https://github.com/rabihchamas/DCLS-GradCAM-Eval。