We introduce LaGTran, a novel framework that utilizes readily available or easily acquired text descriptions to guide robust transfer of discriminative knowledge from labeled source to unlabeled target data with domain shifts. While unsupervised adaptation methods have been established to address this problem, they show limitations in handling challenging domain shifts due to their exclusive operation within the pixel-space. Motivated by our observation that semantically richer text modality has more favorable transfer properties, we devise a transfer mechanism to use a source-trained text-classifier to generate predictions on the target text descriptions, and utilize these predictions as supervision for the corresponding images. Our approach driven by language guidance is surprisingly easy and simple, yet significantly outperforms all prior approaches on challenging datasets like GeoNet and DomainNet, validating its extreme effectiveness. To further extend the scope of our study beyond images, we introduce a new benchmark to study ego-exo transfer in videos and find that our language-aided LaGTran yields significant gains in this highly challenging and non-trivial transfer setting. Code, models, and proposed datasets are publicly available at https://tarun005.github.io/lagtran/.
翻译:摘要:我们提出LaGTran框架,该框架利用现成或易获取的文本描述,指导判别性知识从带标签源域向存在域偏移的无标签目标域进行稳健迁移。尽管已有无监督自适应方法处理该问题,但由于其仅在像素空间内操作,在处理复杂域偏移时存在局限性。基于对语义更丰富的文本模态具有更优迁移特性的观察,我们设计了一种迁移机制:使用源域训练的文本分类器对目标文本描述生成预测,并将这些预测作为对应图像的监督信号。这种语言驱动的处理方法异常简便,却在GeoNet与DomainNet等具有挑战性的数据集上显著超越所有先前方法,验证了其极高的有效性。为将研究范畴扩展到图像之外,我们引入视频中自我-他人迁移研究的新基准,发现语言辅助的LaGTran在该极具挑战且不平凡的迁移场景中取得了显著增益。代码、模型及所提数据集已开源至https://tarun005.github.io/lagtran/。