Numerous examples in the literature proved that deep learning models have the ability to work well with multimodal data. Recently, CLIP has enabled deep learning systems to learn shared latent spaces between images and text descriptions, with outstanding zero- or few-shot results in downstream tasks. In this paper we explore the same idea proposed by CLIP but applied to the speech domain, where the phonetic and acoustic spaces usually coexist. We train a CLIP-based model with the aim to learn shared representations of phonetic and acoustic spaces. The results show that the proposed model is sensible to phonetic changes, with a 91% of score drops when replacing 20% of the phonemes at random, while providing substantial robustness against different kinds of noise, with a 10% performance drop when mixing the audio with 75% of Gaussian noise. We also provide empirical evidence showing that the resulting embeddings are useful for a variety of downstream applications, such as intelligibility evaluation and the ability to leverage rich pre-trained phonetic embeddings in speech generation task. Finally, we discuss potential applications with interesting implications for the speech generation and recognition fields.
翻译:文献中的众多实例证明,深度学习模型具备处理多模态数据的能力。近期,CLIP使深度学习系统能够学习图像与文本描述之间的共享潜在空间,在下游任务中展现出卓越的零样本或少样本性能。本文探索了CLIP提出的相同理念在语音领域的应用——语音领域中声学空间与音系空间通常并存。我们训练了一个基于CLIP的模型,旨在学习音系空间与声学空间的共享表示。结果表明,该模型对音系变化敏感:当随机替换20%的音素时,评分下降幅度达91%;同时,该模型对不同类型噪声具有显著鲁棒性:当音频混合75%高斯噪声时,性能仅下降10%。我们还提供了实证证据,证明所生成的嵌入表示可用于多种下游应用,例如可懂度评估以及利用预训练音系嵌入增强语音生成任务的能力。最后,我们讨论了该模型在语音生成与识别领域具有重要应用潜力的发展方向。