This paper presents CLUSTERFORMER, a universal vision model that is based on the CLUSTERing paradigm with TransFORMER. It comprises two novel designs: 1. recurrent cross-attention clustering, which reformulates the cross-attention mechanism in Transformer and enables recursive updates of cluster centers to facilitate strong representation learning; and 2. feature dispatching, which uses the updated cluster centers to redistribute image features through similarity-based metrics, resulting in a transparent pipeline. This elegant design streamlines an explainable and transferable workflow, capable of tackling heterogeneous vision tasks (i.e., image classification, object detection, and image segmentation) with varying levels of clustering granularity (i.e., image-, box-, and pixel-level). Empirical results demonstrate that CLUSTERFORMER outperforms various well-known specialized architectures, achieving 83.41% top-1 acc. over ImageNet-1K for image classification, 54.2% and 47.0% mAP over MS COCO for object detection and instance segmentation, 52.4% mIoU over ADE20K for semantic segmentation, and 55.8% PQ over COCO Panoptic for panoptic segmentation. For its efficacy, we hope our work can catalyze a paradigm shift in universal models in computer vision.
翻译:本文提出了CLUSTERFORMER,一种基于聚类范式与Transformer的通用视觉模型。该模型包含两项创新设计:1. 循环交叉注意力聚类,通过重构Transformer中的交叉注意力机制,实现聚类中心的递归更新以促进强表征学习;2. 特征调度机制,利用更新后的聚类中心基于相似度度量重新分配图像特征,形成透明化处理流程。这一精巧设计构建了可解释、可迁移的工作流,能够以不同聚类粒度(如图像级、边界框级、像素级)处理异构视觉任务(如图像分类、目标检测和图像分割)。实验结果表明,CLUSTERFORMER超越多种知名专用架构:在ImageNet-1K上图像分类Top-1准确率达83.41%,在MS COCO上目标检测与实例分割mAP分别为54.2%和47.0%,在ADE20K上语义分割mIoU达52.4%,在COCO Panoptic上全景分割PQ达55.8%。基于其有效性,我们期望该工作能推动计算机视觉通用模型的范式转变。