DA-STC: Domain Adaptive Video Semantic Segmentation via Spatio-Temporal Consistency

Video semantic segmentation is a pivotal aspect of video representation learning. However, significant domain shifts present a challenge in effectively learning invariant spatio-temporal features across the labeled source domain and unlabeled target domain for video semantic segmentation. To solve the challenge, we propose a novel DA-STC method for domain adaptive video semantic segmentation, which incorporates a bidirectional multi-level spatio-temporal fusion module and a category-aware spatio-temporal feature alignment module to facilitate consistent learning for domain-invariant features. Firstly, we perform bidirectional spatio-temporal fusion at the image sequence level and shallow feature level, leading to the construction of two fused intermediate video domains. This prompts the video semantic segmentation model to consistently learn spatio-temporal features of shared patch sequences which are influenced by domain-specific contexts, thereby mitigating the feature gap between the source and target domain. Secondly, we propose a category-aware feature alignment module to promote the consistency of spatio-temporal features, facilitating adaptation to the target domain. Specifically, we adaptively aggregate the domain-specific deep features of each category along spatio-temporal dimensions, which are further constrained to achieve cross-domain intra-class feature alignment and inter-class feature separation. Extensive experiments demonstrate the effectiveness of our method, which achieves state-of-the-art mIOUs on multiple challenging benchmarks. Furthermore, we extend the proposed DA-STC to the image domain, where it also exhibits superior performance for domain adaptive semantic segmentation. The source code and models will be made available at \url{https://github.com/ZHE-SAPI/DA-STC}.

翻译：视频语义分割是视频表征学习的关键环节。然而，显著的域偏移在有效学习标注源域与未标注目标域间不变的时空特征方面构成挑战。为应对这一挑战，我们提出一种新颖的DA-STC域自适应视频语义分割方法，该方法融合了双向多层级时空融合模块与类别感知时空特征对齐模块，以促进域不变特征的一致性学习。首先，我们在图像序列层级与浅层特征层级执行双向时空融合，构建两个融合中间视频域。这促使视频语义分割模型一致性学习受域特定上下文影响的共享块序列的时空特征，从而弥合源域与目标域间的特征差异。其次，我们提出类别感知特征对齐模块以增强时空特征一致性，促进对目标域的适应。具体而言，我们沿时空维度自适应聚合每个类别的域特定深层特征，并通过进一步约束实现跨域类内特征对齐与类间特征分离。大量实验证明了本方法的有效性，在多个具有挑战性的基准上取得最优mIOU指标。此外，我们将所提出的DA-STC方法扩展至图像域，在域自适应语义分割中也展现出优异性能。源代码与模型将在\url{https://github.com/ZHE-SAPI/DA-STC}公开。