Most feedforward convolutional neural networks spend roughly the same efforts for each pixel. Yet human visual recognition is an interaction between eye movements and spatial attention, which we will have several glimpses of an object in different regions. Inspired by this observation, we propose an end-to-end trainable Multi-Glimpse Network (MGNet) which aims to tackle the challenges of high computation and the lack of robustness based on recurrent downsampled attention mechanism. Specifically, MGNet sequentially selects task-relevant regions of an image to focus on and then adaptively combines all collected information for the final prediction. MGNet expresses strong resistance against adversarial attacks and common corruptions with less computation. Also, MGNet is inherently more interpretable as it explicitly informs us where it focuses during each iteration. Our experiments on ImageNet100 demonstrate the potential of recurrent downsampled attention mechanisms to improve a single feedforward manner. For example, MGNet improves 4.76% accuracy on average in common corruptions with only 36.9% computational cost. Moreover, while the baseline incurs an accuracy drop to 7.6%, MGNet manages to maintain 44.2% accuracy in the same PGD attack strength with ResNet-50 backbone. Our code is available at https://github.com/siahuat0727/MGNet.
翻译:大多数前馈卷积神经网络对每个像素投入大致相等的计算量。然而,人类视觉识别是眼球运动与空间注意力相互作用的过程,我们会对物体不同区域进行多次"瞥视"。受此启发,我们提出了一种端到端可训练的多瞥网络(MGNet),该网络基于循环降采样注意力机制,旨在解决高计算量问题和缺乏鲁棒性的挑战。具体而言,MGNet 按顺序选取图像中与任务相关的区域进行聚焦,然后自适应地整合所有收集到的信息以进行最终预测。MGNet 以更少的计算量表现出对对抗性攻击和常见污染的强抵抗力。同时,MGNet 本质上更具可解释性,因为它明确告知我们在每次迭代中聚焦的位置。我们在 ImageNet100 上的实验表明,循环降采样注意力机制具有改进单次前馈方式的潜力。例如,MGNet 在仅消耗 36.9% 计算成本的情况下,平均准确率在常见污染中提升 4.76%。此外,当基线模型在相同 PGD 攻击强度下准确率降至 7.6% 时,基于 ResNet-50 骨干网络的 MGNet 仍能维持 44.2% 的准确率。我们的代码开源在 https://github.com/siahuat0727/MGNet。