Source-Free Video Unsupervised Domain Adaptation (SFVUDA) task consists in adapting an action recognition model, trained on a labelled source dataset, to an unlabelled target dataset, without accessing the actual source data. The previous approaches have attempted to address SFVUDA by leveraging self-supervision (e.g., enforcing temporal consistency) derived from the target data itself. In this work, we take an orthogonal approach by exploiting "web-supervision" from Large Language-Vision Models (LLVMs), driven by the rationale that LLVMs contain a rich world prior surprisingly robust to domain-shift. We showcase the unreasonable effectiveness of integrating LLVMs for SFVUDA by devising an intuitive and parameter-efficient method, which we name Domain Adaptation with Large Language-Vision models (DALL-V), that distills the world prior and complementary source model information into a student network tailored for the target. Despite the simplicity, DALL-V achieves significant improvement over state-of-the-art SFVUDA methods.
翻译:无源视频无监督域适应(SFVUDA)任务旨在将标注源数据集上训练的动作识别模型适配到未标注的目标数据集,且不访问实际源数据。现有方法试图通过利用目标数据自身的自监督信号(例如,强制时序一致性)来解决SFVUDA。本研究采用正交思路,利用大型语言-视觉模型(LLVM)的"网络监督",其依据在于LLVM包含丰富且对域偏移具有惊人鲁棒性的世界先验知识。通过设计一种直观且参数高效的方法——我们称之为基于大型语言-视觉模型的域适应(DALL-V),该方法将世界先验知识和互补的源模型信息蒸馏至适配目标的轻量级网络中,展示了将LLVM集成到SFVUDA中的惊人有效性。尽管方法简单,DALL-V相比现有最优SFVUDA方法取得了显著提升。