We propose Masked-Attention Transformers for Surgical Instrument Segmentation (MATIS), a two-stage, fully transformer-based method that leverages modern pixel-wise attention mechanisms for instrument segmentation. MATIS exploits the instance-level nature of the task by employing a masked attention module that generates and classifies a set of fine instrument region proposals. Our method incorporates long-term video-level information through video transformers to improve temporal consistency and enhance mask classification. We validate our approach in the two standard public benchmarks, Endovis 2017 and Endovis 2018. Our experiments demonstrate that MATIS' per-frame baseline outperforms previous state-of-the-art methods and that including our temporal consistency module boosts our model's performance further.
翻译:我们提出了用于手术器械分割的掩码注意力Transformer(MATIS),这是一种基于全Transformer的两阶段方法,通过现代逐像素注意力机制实现器械分割。MATIS利用该任务的实例级特性,采用掩码注意力模块生成并分类一组精细的器械区域候选。该方法通过视频Transformer融合长期视频级信息,以改善时间一致性并增强掩码分类性能。我们在两个标准公开基准数据集Endovis 2017和Endovis 2018上验证了该方法的有效性。实验表明,MATIS的逐帧基线模型已超越先前的最优方法,而加入时间一致性模块后,模型性能得到进一步提升。