VideoAdviser: Video Knowledge Distillation for Multimodal Transfer Learning

Multimodal transfer learning aims to transform pretrained representations of diverse modalities into a common domain space for effective multimodal fusion. However, conventional systems are typically built on the assumption that all modalities exist, and the lack of modalities always leads to poor inference performance. Furthermore, extracting pretrained embeddings for all modalities is computationally inefficient for inference. In this work, to achieve high efficiency-performance multimodal transfer learning, we propose VideoAdviser, a video knowledge distillation method to transfer multimodal knowledge of video-enhanced prompts from a multimodal fundamental model (teacher) to a specific modal fundamental model (student). With an intuition that the best learning performance comes with professional advisers and smart students, we use a CLIP-based teacher model to provide expressive multimodal knowledge supervision signals to a RoBERTa-based student model via optimizing a step-distillation objective loss -- first step: the teacher distills multimodal knowledge of video-enhanced prompts from classification logits to a regression logit -- second step: the multimodal knowledge is distilled from the regression logit of the teacher to the student. We evaluate our method in two challenging multimodal tasks: video-level sentiment analysis (MOSI and MOSEI datasets) and audio-visual retrieval (VEGAS dataset). The student (requiring only the text modality as input) achieves an MAE score improvement of up to 12.3% for MOSI and MOSEI. Our method further enhances the state-of-the-art method by 3.4% mAP score for VEGAS without additional computations for inference. These results suggest the strengths of our method for achieving high efficiency-performance multimodal transfer learning.

翻译：多模态迁移学习旨在将不同模态的预训练表征转换到公共域空间，以实现高效的多模态融合。然而，传统系统通常基于所有模态均存在的假设构建，模态缺失时往往导致推理性能下降。此外，提取所有模态的预训练嵌入特征计算效率低下。本文提出VideoAdviser——一种视频知识蒸馏方法，通过将多模态基础模型（教师网络）中视频增强提示的多模态知识迁移至特定模态基础模型（学生网络），从而实现高效能-高性能的多模态迁移学习。基于"最佳学习效果源自专业指导与聪慧学生"的直觉，我们采用CLIP-based教师模型为RoBERTa-based学生模型提供丰富的多模态知识监督信号，通过优化分步蒸馏目标损失函数实现知识迁移：第一步——教师网络将视频增强提示的多模态知识从分类logits蒸馏为回归logit；第二步——多模态知识从教师的回归logit蒸馏至学生网络。我们在两项具有挑战性的多模态任务上评估了该方法：视频级情感分析（MOSI和MOSEI数据集）与音视频检索（VEGAS数据集）。仅需文本模态输入的学生模型在MOSI和MOSEI数据集上实现了最高12.3%的MAE分数提升。在不增加推理计算量的前提下，我们的方法在VEGAS数据集上将当前最优方法的mAP分数提升3.4%。这些结果表明该方法在实现高效能-高性能多模态迁移学习方面具有显著优势。