The network trained for domain adaptation is prone to bias toward the easy-to-transfer classes. Since the ground truth label on the target domain is unavailable during training, the bias problem leads to skewed predictions, forgetting to predict hard-to-transfer classes. To address this problem, we propose Cross-domain Moving Object Mixing (CMOM) that cuts several objects, including hard-to-transfer classes, in the source domain video clip and pastes them into the target domain video clip. Unlike image-level domain adaptation, the temporal context should be maintained to mix moving objects in two different videos. Therefore, we design CMOM to mix with consecutive video frames, so that unrealistic movements are not occurring. We additionally propose Feature Alignment with Temporal Context (FATC) to enhance target domain feature discriminability. FATC exploits the robust source domain features, which are trained with ground truth labels, to learn discriminative target domain features in an unsupervised manner by filtering unreliable predictions with temporal consensus. We demonstrate the effectiveness of the proposed approaches through extensive experiments. In particular, our model reaches mIoU of 53.81% on VIPER to Cityscapes-Seq benchmark and mIoU of 56.31% on SYNTHIA-Seq to Cityscapes-Seq benchmark, surpassing the state-of-the-art methods by large margins. The code is available at: https://github.com/kyusik-cho/CMOM.
翻译:用于域自适应的网络倾向于偏向易迁移类。由于训练过程中目标域的真实标签不可用,这种偏向问题会导致预测结果偏斜,从而遗忘对难迁移类的预测。为解决该问题,我们提出跨域移动物体混合(CMOM)方法,从源域视频片段中剪切包括难迁移类在内的多个物体,并将其粘贴到目标域视频片段中。与图像级域自适应不同,混合两个不同视频中的移动物体时需保持时间上下文。因此,我们设计CMOM以连续视频帧进行混合,从而避免产生不真实的运动。我们还提出基于时间上下文的特征对齐(FATC)来增强目标域特征的判别性。FATC利用经真实标签训练的鲁棒源域特征,通过时间共识过滤不可靠预测,以无监督方式学习具有判别性的目标域特征。通过大量实验验证了所提方法的有效性。特别地,我们的模型在VIPER到Cityscapes-Seq基准上达到53.81%的mIoU,在SYNTHIA-Seq到Cityscapes-Seq基准上达到56.31%的mIoU,大幅超越了现有最先进方法。代码开源于:https://github.com/kyusik-cho/CMOM。