We propose FEIM-TTS, an innovative zero-shot text-to-speech (TTS) model that synthesizes emotionally expressive speech, aligned with facial images and modulated by emotion intensity. Leveraging deep learning, FEIM-TTS transcends traditional TTS systems by interpreting facial cues and adjusting to emotional nuances without dependence on labeled datasets. To address sparse audio-visual-emotional data, the model is trained using LRS3, CREMA-D, and MELD datasets, demonstrating its adaptability. FEIM-TTS's unique capability to produce high-quality, speaker-agnostic speech makes it suitable for creating adaptable voices for virtual characters. Moreover, FEIM-TTS significantly enhances accessibility for individuals with visual impairments or those who have trouble seeing. By integrating emotional nuances into TTS, our model enables dynamic and engaging auditory experiences for webcomics, allowing visually impaired users to enjoy these narratives more fully. Comprehensive evaluation evidences its proficiency in modulating emotion and intensity, advancing emotional speech synthesis and accessibility. Samples are available at: https://feim-tts.github.io/.
翻译:我们提出FEIM-TTS,一种创新的零样本文本转语音(TTS)模型,能够合成与面部图像对齐且受情感强度调节的情感表达语音。该模型基于深度学习,通过解读面部线索并适应情感细微变化,超越了依赖标注数据集的传统TTS系统。针对视听情感数据稀疏的问题,模型使用LRS3、CREMA-D和MELD数据集进行训练,展现了其适应能力。FEIM-TTS能够生成高质量、与说话者无关的语音,这一独特能力使其适用于为虚拟角色创建自适应语音。此外,该模型显著提升了视障人士或视力受限群体的可访问性。通过将情感细微变化融入TTS,我们的模型能为网络漫画提供动态且引人入胜的听觉体验,使视障用户能更充分地享受这些叙事内容。综合评估证明了其在调节情感与强度方面的卓越性能,推动了情感语音合成与可访问性技术的发展。样本请访问:https://feim-tts.github.io/。