Exploring Plain ViT Reconstruction for Multi-class Unsupervised Anomaly Detection

This work studies the recently proposed challenging and practical Multi-class Unsupervised Anomaly Detection (MUAD) task, which only requires normal images for training while simultaneously testing both normal/anomaly images for multiple classes. Existing reconstruction-based methods typically adopt pyramid networks as encoders/decoders to obtain multi-resolution features, accompanied by elaborate sub-modules with heavier handcraft engineering designs for more precise localization. In contrast, a plain Vision Transformer (ViT) with simple architecture has been shown effective in multiple domains, which is simpler, more effective, and elegant. Following this spirit, this paper explores plain ViT architecture for MUAD. Specifically, we abstract a Meta-AD concept by inducing current reconstruction-based methods. Then, we instantiate a novel and elegant plain ViT-based symmetric ViTAD structure, effectively designed step by step from three macro and four micro perspectives. In addition, this paper reveals several interesting findings for further exploration. Finally, we propose a comprehensive and fair evaluation benchmark on eight metrics for the MUAD task. Based on a naive training recipe, ViTAD achieves state-of-the-art (SoTA) results and efficiency on the MVTec AD and VisA datasets without bells and whistles, obtaining 85.4 mAD that surpasses SoTA UniAD by +3.0, and only requiring 1.1 hours and 2.3G GPU memory to complete model training by a single V100 GPU. Source code, models, and more results are available at https://zhangzjn.github.io/projects/ViTAD.

翻译：本文研究了近期提出的具有挑战性和实用性的多类别无监督异常检测（MUAD）任务，该任务仅需正常图像进行训练，同时测试多类别的正常/异常图像。现有基于重建的方法通常采用金字塔网络作为编码器/解码器以获取多分辨率特征，并伴随更精细的模块化手工工程设计以实现更精确的定位。相比之下，架构简单的纯视觉Transformer（ViT）已被证明在多个领域中更为简洁、有效且优雅。秉承这一理念，本文探索了纯ViT架构在MUAD中的应用。具体而言，我们通过归纳当前基于重建的方法，抽象出元异常检测（Meta-AD）概念。随后，我们逐步从三个宏观视角和四个微观视角有效设计，实例化了一种新颖且优雅的基于纯ViT的对称结构ViTAD。此外，本文揭示了几项值得进一步探索的有趣发现。最终，我们为MUAD任务提出了涵盖八项指标的综合且公平的评价基准。基于朴素的训练策略，ViTAD在MVTec AD和VisA数据集上无需繁琐调优即可达到最先进（SoTA）的结果和效率，获得85.4 mAD，超越SoTA方法UniAD达+3.0，且仅需单块V100 GPU耗时1.1小时、占用2.3G显存即可完成模型训练。源代码、模型及更多结果见 https://zhangzjn.github.io/projects/ViTAD。