This paper presents CLUSTERFORMER, a universal vision model that is based on the CLUSTERing paradigm with TransFORMER. It comprises two novel designs: 1. recurrent cross-attention clustering, which reformulates the cross-attention mechanism in Transformer and enables recursive updates of cluster centers to facilitate strong representation learning; and 2. feature dispatching, which uses the updated cluster centers to redistribute image features through similarity-based metrics, resulting in a transparent pipeline. This elegant design streamlines an explainable and transferable workflow, capable of tackling heterogeneous vision tasks (i.e., image classification, object detection, and image segmentation) with varying levels of clustering granularity (i.e., image-, box-, and pixel-level). Empirical results demonstrate that CLUSTERFORMER outperforms various well-known specialized architectures, achieving 83.41% top-1 acc. over ImageNet-1K for image classification, 54.2% and 47.0% mAP over MS COCO for object detection and instance segmentation, 52.4% mIoU over ADE20K for semantic segmentation, and 55.8% PQ over COCO Panoptic for panoptic segmentation. For its efficacy, we hope our work can catalyze a paradigm shift in universal models in computer vision.
翻译:本文提出CLUSTERFORMER,一种基于聚类范式与Transformer的通用视觉模型。该模型包含两项创新设计:1)循环交叉注意力聚类,通过重构Transformer中的交叉注意力机制,实现聚类中心的递归更新以促进强表征学习;2)特征调度,利用更新后的聚类中心通过相似度度量重新分配图像特征,形成透明化处理流程。这一精巧设计构建了可解释且可迁移的工作流,能够以不同聚类粒度(即图像级、框级与像素级)处理异构视觉任务(如图像分类、目标检测与图像分割)。实验结果表明,CLUSTERFORMER超越多种知名专用架构:在ImageNet-1K图像分类任务上达83.41% Top-1准确率,在MS COCO目标检测与实例分割任务上分别达54.2%和47.0% mAP,在ADE20K语义分割任务上达52.4% mIoU,在COCO Panoptic全景分割任务上达55.8% PQ。鉴于其有效性,我们期待该工作能推动计算机视觉通用模型的范式转变。