Reconstruction-guided attention improves the robustness and shape processing of neural networks

Many visual phenomena suggest that humans use top-down generative or reconstructive processes to create visual percepts (e.g., imagery, object completion, pareidolia), but little is known about the role reconstruction plays in robust object recognition. We built an iterative encoder-decoder network that generates an object reconstruction and used it as top-down attentional feedback to route the most relevant spatial and feature information to feed-forward object recognition processes. We tested this model using the challenging out-of-distribution digit recognition dataset, MNIST-C, where 15 different types of transformation and corruption are applied to handwritten digit images. Our model showed strong generalization performance against various image perturbations, on average outperforming all other models including feedforward CNNs and adversarially trained networks. Our model is particularly robust to blur, noise, and occlusion corruptions, where shape perception plays an important role. Ablation studies further reveal two complementary roles of spatial and feature-based attention in robust object recognition, with the former largely consistent with spatial masking benefits in the attention literature (the reconstruction serves as a mask) and the latter mainly contributing to the model's inference speed (i.e., number of time steps to reach a certain confidence threshold) by reducing the space of possible object hypotheses. We also observed that the model sometimes hallucinates a non-existing pattern out of noise, leading to highly interpretable human-like errors. Our study shows that modeling reconstruction-based feedback endows AI systems with a powerful attention mechanism, which can help us understand the role of generating perception in human visual processing.

翻译：许多视觉现象表明，人类利用自上而下的生成或重建过程形成视觉感知（例如，想象、物体完形、空想性错视），但重建在鲁棒物体识别中的作用尚不明确。我们构建了一个迭代式编码器-解码器网络，用于生成物体重建，并将其作为自上而下的注意反馈，将最相关的空间和特征信息路由至前馈物体识别过程。我们利用极具挑战性的分布外数字识别数据集MNIST-C测试该模型，该数据集对手写数字图像施加了15种不同类型的变换与腐蚀。模型对各种图像扰动展现出强大的泛化性能，平均表现优于所有其他模型，包括前馈CNN和对抗训练网络。该模型尤其对模糊、噪声和遮挡腐蚀具有鲁棒性——这些场景下形状感知发挥关键作用。消融研究进一步揭示了空间注意与特征注意在鲁棒物体识别中的互补作用：前者与注意力文献中的空间掩蔽效应（重建充当掩蔽）高度一致，后者主要通过缩减可能的物体假设空间来提升模型推理速度（即达到特定置信阈值所需的时间步数）。我们还观察到，模型有时会从噪声中幻觉出非存在的模式，导致高度可解释的类人错误。本研究表明，建模基于重建的反馈能为AI系统赋予强大的注意力机制，这有助于理解生成式感知在人类视觉处理中的作用。