This paper presents an adaptive transformer model named SegmATRon for embodied image semantic segmentation. Its distinctive feature is the adaptation of model weights during inference on several images using a hybrid multicomponent loss function. We studied this model on datasets collected in the photorealistic Habitat and the synthetic AI2-THOR Simulators. We showed that obtaining additional images using the agent's actions in an indoor environment can improve the quality of semantic segmentation. The code of the proposed approach and datasets are publicly available at https://github.com/wingrune/SegmATRon.
翻译:本文提出了一种名为SegmATRon的自适应Transformer模型,用于具身图像语义分割。其显著特征在于利用混合多分量损失函数在推理过程中对多幅图像进行模型权重的自适应调整。我们在基于逼真Habitat模拟器与合成AI2-THOR模拟器采集的数据集上对该模型进行了研究。结果表明,利用智能体在室内环境中的动作获取额外图像可提升语义分割质量。所提方法代码及数据集已开源至https://github.com/wingrune/SegmATRon。