Although action recognition has achieved impressive results over recent years, both collection and annotation of video training data are still time-consuming and cost intensive. Therefore, image-to-video adaptation has been proposed to exploit labeling-free web image source for adapting on unlabeled target videos. This poses two major challenges: (1) spatial domain shift between web images and video frames; (2) modality gap between image and video data. To address these challenges, we propose Cycle Domain Adaptation (CycDA), a cycle-based approach for unsupervised image-to-video domain adaptation by leveraging the joint spatial information in images and videos on the one hand and, on the other hand, training an independent spatio-temporal model to bridge the modality gap. We alternate between the spatial and spatio-temporal learning with knowledge transfer between the two in each cycle. We evaluate our approach on benchmark datasets for image-to-video as well as for mixed-source domain adaptation achieving state-of-the-art results and demonstrating the benefits of our cyclic adaptation. Code is available at \url{https://github.com/wlin-at/CycDA}.
翻译:尽管近年来动作识别取得了令人瞩目的成果,但视频训练数据的采集与标注仍耗时费力。为此,研究者提出图像到视频的域适应方法,利用无标签的网络图像源对未标注的目标视频进行适应。这带来两大挑战:(1)网络图像与视频帧之间的空间域偏移;(2)图像与视频数据之间的模态差异。为解决上述问题,我们提出循环域适应(CycDA),一种基于循环的无监督图像到视频域适应方法——一方面利用图像与视频中的联合空间信息,另一方面训练独立的时空模型以弥合模态差异。我们在空间学习与时空学习之间交替进行,并在每个循环中实现两者间的知识迁移。在图像到视频及混合源域适应的基准数据集上,我们的方法取得了最先进的成果,验证了循环适应的优势。代码见\url{https://github.com/wlin-at/CycDA}。