This work investigates the use of large-scale, English-only pre-trained models (CLIP and HuBERT) for multilingual image-speech retrieval. For non-English image-speech retrieval, we outperform the current state-of-the-art performance by a wide margin both when training separate models for each language, and with a single model which processes speech in all three languages. We identify key differences in model behavior and performance between English and non-English settings, attributable to the English-only pre-training of CLIP and HuBERT, and investigate how fine-tuning the pre-trained models impacts these differences. Finally, we show that our models can be used for mono- and cross-lingual speech-text retrieval and cross-lingual speech-speech retrieval, despite never having seen any parallel speech-text or speech-speech data during training.
翻译:本研究探究了仅使用英语预训练的大规模模型(CLIP和HuBERT)在多语言图像-语音检索中的应用。在非英语图像-语音检索任务中,我们分别针对每种语言训练独立模型以及使用单一模型处理三种语言的语音时,均大幅超越了当前最先进的性能。我们识别了CLIP和HuBERT的英语单语预训练所导致的英语与非英语场景下模型行为及性能的关键差异,并探究了微调预训练模型对差异的影响。最后,我们证明尽管训练过程中从未见过任何平行的语音-文本或语音-语音数据,但我们的模型仍可用于单语及跨语言的语音-文本检索以及跨语言的语音-语音检索。