Open-set Unsupervised Video Domain Adaptation (OUVDA) deals with the task of adapting an action recognition model from a labelled source domain to an unlabelled target domain that contains "target-private" categories, which are present in the target but absent in the source. In this work we deviate from the prior work of training a specialized open-set classifier or weighted adversarial learning by proposing to use pre-trained Language and Vision Models (CLIP). The CLIP is well suited for OUVDA due to its rich representation and the zero-shot recognition capabilities. However, rejecting target-private instances with the CLIP's zero-shot protocol requires oracle knowledge about the target-private label names. To circumvent the impossibility of the knowledge of label names, we propose AutoLabel that automatically discovers and generates object-centric compositional candidate target-private class names. Despite its simplicity, we show that CLIP when equipped with AutoLabel can satisfactorily reject the target-private instances, thereby facilitating better alignment between the shared classes of the two domains. The code is available.
翻译:开放集无监督视频域自适应(OUVDA)处理的任务是:将动作识别模型从标注的源域适应到未标注的目标域,而目标域包含"目标私有"类别(即目标域中存在但源域中缺失的类别)。在本工作中,我们摒弃以往训练专用开放集分类器或加权对抗学习的做法,提出使用预训练的语言-视觉模型(CLIP)。CLIP凭借其丰富的表征能力和零样本识别能力,非常适合OUVDA任务。然而,使用CLIP的零样本协议拒绝目标私有实例需要关于目标私有标签名称的先验知识。为规避无法获取标签名称的问题,我们提出AutoLabel——该方法可自动发现并生成面向目标的组合式候选目标私有类别名称。尽管方法简单,我们证明了配备AutoLabel的CLIP能够令人满意地拒绝目标私有实例,从而促进两域共享类别间的更好对齐。相关代码已公开。