In this paper, we introduce a massively multilingual speech corpora with fine-grained phonemic transcriptions, encompassing more than 115 languages from diverse language families. Based on this multilingual dataset, we propose CLAP-IPA, a multilingual phoneme-speech contrastive embedding model capable of open-vocabulary matching between speech signals and phonemically transcribed keywords or arbitrary phrases. The proposed model has been tested on two fieldwork speech corpora in 97 unseen languages, exhibiting strong generalizability across languages. Comparison with a text-based model shows that using phonemes as modeling units enables much better crosslinguistic generalization than orthographic texts.
翻译:本文提出一个大规模多语言语音语料库,该语料库包含来自115种以上不同语系的细粒度音素标注。基于此多语言数据集,我们提出CLAP-IPA模型——一种多语言音素-语音对比嵌入模型,能够实现语音信号与音素标注关键词或任意短语之间的开词汇匹配。该模型已在两个包含97种未见语言的野外语音语料库上进行测试,展现出跨语言的强泛化能力。与基于文本的模型对比表明:以音素作为建模单元比正字法文本具有更优的跨语言泛化效果。