We present a mask-piloted Transformer which improves masked-attention in Mask2Former for image segmentation. The improvement is based on our observation that Mask2Former suffers from inconsistent mask predictions between consecutive decoder layers, which leads to inconsistent optimization goals and low utilization of decoder queries. To address this problem, we propose a mask-piloted training approach, which additionally feeds noised ground-truth masks in masked-attention and trains the model to reconstruct the original ones. Compared with the predicted masks used in mask-attention, the ground-truth masks serve as a pilot and effectively alleviate the negative impact of inaccurate mask predictions in Mask2Former. Based on this technique, our \M achieves a remarkable performance improvement on all three image segmentation tasks (instance, panoptic, and semantic), yielding $+2.3$AP and $+1.6$mIoU on the Cityscapes instance and semantic segmentation tasks with a ResNet-50 backbone. Our method also significantly speeds up the training, outperforming Mask2Former with half of the number of training epochs on ADE20K with both a ResNet-50 and a Swin-L backbones. Moreover, our method only introduces little computation during training and no extra computation during inference. Our code will be released at \url{https://github.com/IDEA-Research/MP-Former}.
翻译:我们提出了一种掩码引导的Transformer,改进了Mask2Former中用于图像分割的掩码注意力机制。该改进基于我们的观察:Mask2Former在连续解码器层之间存在不一致的掩码预测,导致优化目标不一致且解码器查询利用率低下。为解决此问题,我们提出掩码引导训练方法,该方法在掩码注意力中额外注入带噪声的真实掩码,并训练模型重建原始掩码。与掩码注意力中使用的预测掩码相比,真实掩码作为引导信号,有效缓解了Mask2Former中不准确掩码预测带来的负面影响。基于此技术,我们的\M方法在所有三项图像分割任务(实例分割、全景分割和语义分割)上均取得了显著性能提升,在Cityscapes数据集上使用ResNet-50骨干网络时,实例分割AP提升+2.3,语义分割mIoU提升+1.6。该方法还显著加速了训练,在ADE20K数据集上分别使用ResNet-50和Swin-L骨干网络时,仅需Mask2Former一半的训练轮次即可超越其性能。此外,我们的方法仅在训练阶段引入少量计算开销,推理阶段无额外计算。代码将发布于\url{https://github.com/IDEA-Research/MP-Former}。