Modern TTS systems are capable of creating highly realistic and natural-sounding speech. Despite these developments, the process of customizing TTS voices remains a complex task, mostly requiring the expertise of specialists within the field. One reason for this is the utilization of deep learning models, which are characterized by their expansive, non-interpretable parameter spaces, restricting the feasibility of manual customization. In this paper, we present a novel human-in-the-loop paradigm based on an evolutionary algorithm for directly interacting with the parameter space of a neural TTS model. We integrated our approach into a user-friendly graphical user interface that allows users to efficiently create original voices. Those voices can then be used with the backbone TTS model, for which we provide a Python API. Further, we present the results of a user study exploring the capabilities of VoiceX. We show that VoiceX is an appropriate tool for creating individual, custom voices.
翻译:现代文本转语音(TTS)系统能够生成高度逼真且自然的语音。尽管取得了这些进展,定制TTS语音的过程仍然是一项复杂的任务,通常需要该领域专家的参与。原因之一在于深度学习模型的使用,这些模型具有庞大且难以解释的参数空间,限制了手动定制的可行性。本文提出了一种新颖的人机协同范式,该方法基于进化算法,可直接与神经TTS模型的参数空间进行交互。我们将该方法集成到一个用户友好的图形用户界面中,使用户能够高效地创建原创语音。这些语音随后可与骨干TTS模型配合使用,我们为此提供了Python API。此外,我们展示了一项探索VoiceX能力的用户研究结果。研究表明,VoiceX是创建个性化、定制化语音的合适工具。