Semantic segmentation networks, which are essential for robotic perception, often suffer from performance degradation when the visual distribution of the deployment environment differs from that of the source dataset on which they were trained. Unsupervised Domain Adaptation (UDA) addresses this challenge by adapting the network to the robot's target environment without external supervision, leveraging the large amounts of data a robot might naturally collect during long-term operation. In such settings, UDA methods can exploit multi-view consistency across the environment's map to fine-tune the model in an unsupervised fashion and mitigate domain shift. However, these approaches remain sensitive to cross-view instance-level inconsistencies. In this work, we propose a method that starts from a volumetric 3D map to generate multi-view consistent pseudo-labels. We then refine these labels using the zero-shot instance segmentation capabilities of a foundation model, enforcing instance-level coherence. The refined annotations serve as supervision for self-supervised fine-tuning, enabling the robot to adapt its perception system at deployment time. Experiments on real-world data demonstrate that our approach consistently improves performance over state-of-the-art UDA baselines based on multi-view consistency, without requiring any ground-truth labels in the target domain.
翻译:语义分割网络对于机器人感知至关重要,但当部署环境的视觉分布与其训练所用的源数据集存在差异时,其性能往往会下降。无监督域自适应通过利用机器人在长期运行中自然收集的大量数据,在无需外部监督的情况下将网络适应到机器人的目标环境,从而应对这一挑战。在此类场景中,UDA方法可利用环境地图中的多视角一致性,以无监督方式微调模型并缓解域偏移。然而,这些方法对跨视角实例级不一致性仍较为敏感。本研究提出一种方法:首先从体素三维地图生成多视角一致的伪标签,随后利用基础模型的零样本实例分割能力对这些标签进行细化,以强化实例级一致性。精炼后的标注作为自监督微调的监督信号,使机器人能够在部署时自适应其感知系统。在真实世界数据上的实验表明,我们的方法在基于多视角一致性的最先进UDA基线基础上持续提升了性能,且无需目标域的任何真实标注。