Vision Transformers (ViTs) have demonstrated remarkable performance in image classification tasks, particularly when equipped with local information via region attention or convolutions. While such architectures improve the feature aggregation from different granularities, they often fail to contribute to the robustness of the networks. Neural Cellular Automata (NCA) enables the modeling of global cell representations through local interactions, with its training strategies and architecture design conferring strong generalization ability and robustness against noisy inputs. In this paper, we propose Adaptor Neural Cellular Automata (AdaNCA) for Vision Transformer that uses NCA as plug-in-play adaptors between ViT layers, enhancing ViT's performance and robustness against adversarial samples as well as out-of-distribution inputs. To overcome the large computational overhead of standard NCAs, we propose Dynamic Interaction for more efficient interaction learning. Furthermore, we develop an algorithm for identifying the most effective insertion points for AdaNCA based on our analysis of AdaNCA placement and robustness improvement. With less than a 3% increase in parameters, AdaNCA contributes to more than 10% absolute improvement in accuracy under adversarial attacks on the ImageNet1K benchmark. Moreover, we demonstrate with extensive evaluations across 8 robustness benchmarks and 4 ViT architectures that AdaNCA, as a plug-in-play module, consistently improves the robustness of ViTs.
翻译:视觉Transformer(ViT)在图像分类任务中展现出卓越性能,尤其在通过区域注意力或卷积融入局部信息时。尽管此类架构提升了不同粒度特征聚合的能力,但它们通常无法有效增强网络的鲁棒性。神经细胞自动机(NCA)能够通过局部交互建模全局细胞表示,其训练策略与架构设计赋予模型强大的泛化能力以及对噪声输入的鲁棒性。本文提出用于视觉Transformer的适配器神经细胞自动机(AdaNCA),将NCA作为即插即用适配器嵌入ViT层之间,以提升ViT在对抗样本和分布外输入下的性能与鲁棒性。为克服标准NCA计算开销大的问题,我们提出动态交互机制以实现更高效的交互学习。此外,基于对AdaNCA布局与鲁棒性提升的分析,我们开发了一种算法用于确定AdaNCA的最优插入位置。在ImageNet1K基准测试中,AdaNCA仅增加不足3%的参数量,即在对抗攻击下带来超过10%的绝对准确率提升。进一步地,我们在8个鲁棒性基准和4种ViT架构上的大量实验表明,AdaNCA作为即插即用模块能够持续提升ViT的鲁棒性。