Multilingual text-video retrieval methods have improved significantly in recent years, but the performance for other languages lags behind English. We propose a Cross-Lingual Cross-Modal Knowledge Distillation method to improve multilingual text-video retrieval. Inspired by the fact that English text-video retrieval outperforms other languages, we train a student model using input text in different languages to match the cross-modal predictions from teacher models using input text in English. We propose a cross entropy based objective which forces the distribution over the student's text-video similarity scores to be similar to those of the teacher models. We introduce a new multilingual video dataset, Multi-YouCook2, by translating the English captions in the YouCook2 video dataset to 8 other languages. Our method improves multilingual text-video retrieval performance on Multi-YouCook2 and several other datasets such as Multi-MSRVTT and VATEX. We also conducted an analysis on the effectiveness of different multilingual text models as teachers. The code, models, and dataset are available at https://github.com/roudimit/c2kd.
翻译:多语言文本-视频检索方法近年来取得了显著进展,但非英语语言的性能仍落后于英语。我们提出了一种跨语言跨模态知识蒸馏方法,旨在提升多语言文本-视频检索的性能。受英语文本-视频检索表现优于其他语言这一事实启发,我们训练了一个学生模型,使用不同语言的输入文本,使其与基于英语输入文本的教师模型的跨模态预测保持一致。我们提出了一种基于交叉熵的目标函数,迫使学生模型的文本-视频相似度分数分布与教师模型相似。通过将YouCook2视频数据集中的英文字幕翻译成其他8种语言,我们构建了一个新的多语言视频数据集Multi-YouCook2。我们的方法在Multi-YouCook2以及Multi-MSRVTT和VATEX等多个数据集上提升了多语言文本-视频检索的性能。我们还对不同多语言文本模型作为教师的有效性进行了分析。相关代码、模型及数据集已开源至https://github.com/roudimit/c2kd。