With rapid globalization, the need to build inclusive and representative speech technology cannot be overstated. Accent is an important aspect of speech that needs to be taken into consideration while building inclusive speech synthesizers. Inclusive speech technology aims to erase any biases towards specific groups, such as people of certain accent. We note that state-of-the-art Text-to-Speech (TTS) systems may currently not be suitable for all people, regardless of their background, as they are designed to generate high-quality voices without focusing on accent. In this paper, we propose a TTS model that utilizes a Multi-Level Variational Autoencoder with adversarial learning to address accented speech synthesis and conversion in TTS, with a vision for more inclusive systems in the future. We evaluate the performance through both objective metrics and subjective listening tests. The results show an improvement in accent conversion ability compared to the baseline.
翻译:随着全球化的快速发展,构建包容且具有代表性的语音技术的重要性不言而喻。口音是语音的一个重要方面,在构建包容性语音合成器时必须予以考虑。包容性语音技术旨在消除对特定群体(例如具有特定口音的人群)的任何偏见。我们注意到,目前最先进的文本转语音(TTS)系统可能并不适合所有人,无论其背景如何,因为它们旨在生成高质量语音,而未专注于口音。在本文中,我们提出了一种TTS模型,该模型利用具有对抗学习的多级变分自编码器来解决TTS中的口音语音合成与转换问题,旨在为未来构建更具包容性的系统。我们通过客观指标和主观听力测试来评估其性能。结果显示,与基线相比,该模型在口音转换能力上有所提升。