This study introduces CLASP (Contrastive Language-Speech Pretraining), a multilingual, multimodal representation tailored for audio-text information retrieval. CLASP leverages the synergy between spoken content and textual data. During training, we utilize our newly introduced speech-text dataset, which encompasses 15 diverse categories ranging from fiction to religion. CLASP's audio component integrates audio spectrograms with a pre-trained self-supervised speech model, while its language encoding counterpart employs a sentence encoder pre-trained on over 100 languages. This unified lightweight model bridges the gap between various modalities and languages, enhancing its effectiveness in handling and retrieving multilingual and multimodal data. Our evaluations across multiple languages demonstrate that CLASP establishes new benchmarks in HITS@1, MRR, and meanR metrics, outperforming traditional ASR-based retrieval approaches in specific scenarios.
翻译:本研究提出了CLASP(对比性语言-语音预训练),一种专为音频-文本信息检索设计的跨语言多模态表示方法。CLASP充分利用了语音内容与文本数据之间的协同效应。在训练过程中,我们采用了新构建的语音-文本数据集,该数据集涵盖从小说到宗教等15个不同类别。CLASP的音频处理模块将音频频谱图与预训练的自监督语音模型相结合,而其语言编码模块则采用在超过100种语言上预训练的句子编码器。这一统一的轻量级模型弥合了不同模态与语言之间的鸿沟,显著提升了处理与检索多语言多模态数据的效能。我们在多种语言上的评估结果表明,CLASP在HITS@1、MRR和meanR指标上均创造了新的性能基准,在特定场景下超越了传统的基于自动语音识别(ASR)的检索方法。