Vision Transformers (ViTs) have demonstrated remarkable performance in image classification tasks, particularly when equipped with local information via region attention or convolutions. While such architectures improve the feature aggregation from different granularities, they often fail to contribute to the robustness of the networks. Neural Cellular Automata (NCA) enables the modeling of global cell representations through local interactions, with its training strategies and architecture design conferring strong generalization ability and robustness against noisy inputs. In this paper, we propose Adaptor Neural Cellular Automata (AdaNCA) for Vision Transformer that uses NCA as plug-in-play adaptors between ViT layers, enhancing ViT's performance and robustness against adversarial samples as well as out-of-distribution inputs. To overcome the large computational overhead of standard NCAs, we propose Dynamic Interaction for more efficient interaction learning. Furthermore, we develop an algorithm for identifying the most effective insertion points for AdaNCA based on our analysis of AdaNCA placement and robustness improvement. With less than a 3% increase in parameters, AdaNCA contributes to more than 10% absolute improvement in accuracy under adversarial attacks on the ImageNet1K benchmark. Moreover, we demonstrate with extensive evaluations across 8 robustness benchmarks and 4 ViT architectures that AdaNCA, as a plug-in-play module, consistently improves the robustness of ViTs.
翻译:视觉Transformer(ViT)在图像分类任务中展现出卓越性能,尤其是在通过区域注意力或卷积融入局部信息时。尽管此类架构改进了不同粒度特征间的聚合能力,它们往往难以提升网络的鲁棒性。神经细胞自动机(NCA)能够通过局部交互建模全局细胞表示,其训练策略与架构设计赋予模型强大的泛化能力及对噪声输入的鲁棒性。本文提出适配器神经细胞自动机(AdaNCA)用于视觉Transformer,将NCA作为即插即用适配器插入ViT层间,以增强ViT的性能及其对抗性样本与分布外输入的鲁棒性。为克服标准NCA的巨大计算开销,我们提出动态交互机制以实现更高效的交互学习。此外,基于对AdaNCA放置位置与鲁棒性提升的分析,我们开发了一种算法用于确定AdaNCA的最有效插入点。在ImageNet1K基准测试中,AdaNCA以不足3%的参数增量,在对抗攻击下实现了超过10%的绝对准确率提升。进一步地,我们在8个鲁棒性基准和4种ViT架构上的大量评估表明,AdaNCA作为即插即用模块能够持续提升ViT的鲁棒性。