Amodal Instance Segmentation (AIS) aims to segment the region of both visible and possible occluded parts of an object instance. While Mask R-CNN-based AIS approaches have shown promising results, they are unable to model high-level features coherence due to the limited receptive field. The most recent transformer-based models show impressive performance on vision tasks, even better than Convolution Neural Networks (CNN). In this work, we present AISFormer, an AIS framework, with a Transformer-based mask head. AISFormer explicitly models the complex coherence between occluder, visible, amodal, and invisible masks within an object's regions of interest by treating them as learnable queries. Specifically, AISFormer contains four modules: (i) feature encoding: extract ROI and learn both short-range and long-range visual features. (ii) mask transformer decoding: generate the occluder, visible, and amodal mask query embeddings by a transformer decoder (iii) invisible mask embedding: model the coherence between the amodal and visible masks, and (iv) mask predicting: estimate output masks including occluder, visible, amodal and invisible. We conduct extensive experiments and ablation studies on three challenging benchmarks i.e. KINS, D2SA, and COCOA-cls to evaluate the effectiveness of AISFormer. The code is available at: https://github.com/UARK-AICV/AISFormer
翻译:全景实例分割(AIS)旨在分割物体实例的可见区域及可能被遮挡的部分。尽管基于Mask R-CNN的AIS方法已取得显著成果,但由于感受野受限,它们无法建模高层特征的连贯性。最新基于Transformer的模型在视觉任务中展现出优于卷积神经网络(CNN)的性能。本文提出AISFormer——一种基于Transformer掩码头的AIS框架。AISFormer通过将遮挡物、可见区域、全景区域及不可见区域掩码视为可学习查询,显式建模目标感兴趣区域内复杂连贯性。具体而言,AISFormer包含四个模块:(i)特征编码:提取ROI并学习短程与长程视觉特征;(ii)掩码Transformer解码:通过Transformer解码器生成遮挡物、可见区域及全景区域掩码查询嵌入;(iii)不可见区域掩码嵌入:建模全景掩码与可见掩码之间的连贯性;(iv)掩码预测:估计包括遮挡物、可见区域、全景区域及不可见区域的输出掩码。我们在KINS、D2SA和COCOA-cls三个挑战性基准上进行大量实验与消融研究,验证AISFormer的有效性。代码地址:https://github.com/UARK-AICV/AISFormer