Numerous examples in the literature proved that deep learning models have the ability to work well with multimodal data. Recently, CLIP has enabled deep learning systems to learn shared latent spaces between images and text descriptions, with outstanding zero- or few-shot results in downstream tasks. In this paper we explore the same idea proposed by CLIP but applied to the speech domain, where the phonetic and acoustic spaces usually coexist. We train a CLIP-based model with the aim to learn shared representations of phonetic and acoustic spaces. The results show that the proposed model is sensible to phonetic changes, with a 91% of score drops when replacing 20% of the phonemes at random, while providing substantial robustness against different kinds of noise, with a 10% performance drop when mixing the audio with 75% of Gaussian noise. We also provide empirical evidence showing that the resulting embeddings are useful for a variety of downstream applications, such as intelligibility evaluation and the ability to leverage rich pre-trained phonetic embeddings in speech generation task. Finally, we discuss potential applications with interesting implications for the speech generation and recognition fields.
翻译:文献中的大量实例证明,深度学习模型具有处理多模态数据的优异能力。近期,CLIP使深度学习系统能够学习图像与文本描述之间的共享潜在空间,在下游任务中展现出卓越的零样本或少样本性能。本文探索了CLIP提出的相同理念,并将其应用于语音领域——在该领域中,音素空间与声学空间通常共存。我们训练了一个基于CLIP的模型,旨在学习音素空间与声学空间的共享表征。结果表明,该模型对音素变化高度敏感:当随机替换20%的音素时,评分下降率达91%;同时,模型对各类噪声具有显著鲁棒性:当音频中混入75%的高斯噪声时,性能仅下降10%。我们还提供了实证证据,表明生成的嵌入向量可有效应用于多种下游任务,例如可懂度评估,以及在语音生成任务中利用预训练的丰富音素嵌入。最后,我们讨论了该工作在语音生成与识别领域具有重要意义的潜在应用方向。