Neural Text-to-speech (TTS) synthesis is a powerful technology that can generate speech using neural networks. One of the most remarkable features of TTS synthesis is its capability to produce speech in the voice of different speakers. This paper introduces voice cloning and speech synthesis https://pypi.org/project/voice-cloning/ an open-source python package for helping speech disorders to communicate more effectively as well as for professionals seeking to integrate voice cloning or speech synthesis capabilities into their projects. This package aims to generate synthetic speech that sounds like the natural voice of an individual, but it does not replace the natural human voice. The architecture of the system comprises a speaker verification system, a synthesizer, a vocoder, and noise reduction. Speaker verification system trained on a varied set of speakers to achieve optimal generalization performance without relying on transcriptions. Synthesizer is trained using both audio and transcriptions that generate Mel spectrogram from a text and vocoder which converts the generated Mel Spectrogram into corresponding audio signal. Then the audio signal is processed by a noise reduction algorithm to eliminate unwanted noise and enhance speech clarity. The performance of synthesized speech from seen and unseen speakers are then evaluated using subjective and objective evaluation such as Mean Opinion Score (MOS), Gross Pitch Error (GPE), and Spectral distortion (SD). The model can create speech in distinct voices by including speaker characteristics that are chosen randomly.
翻译:神经文本转语音(TTS)合成是一种能够利用神经网络生成语音的强大技术。其最显著的特性之一,在于能够产生不同说话人声音的语音。本文介绍了voice克隆与语音合成工具包(https://pypi.org/project/voice-cloning/),这是一个开源Python工具包,旨在帮助言语障碍者更有效地进行沟通,同时服务于希望在项目中集成语音克隆或语音合成功能的专业人士。该工具包的目标是生成听起来像个人自然声音的合成语音,但它并不能替代自然的人声。系统架构包含说话人验证系统、合成器、声码器以及降噪模块。说话人验证系统在多样化的说话人集上进行训练,以实现无需依赖转录文本的最优泛化性能。合成器利用音频和转录文本进行训练,从文本生成梅尔频谱图;声码器则将生成的梅尔频谱图转换为对应的音频信号。随后,音频信号通过降噪算法处理,以消除不必要的噪声并提升语音清晰度。针对已见与未见说话人的合成语音性能,采用主观与客观评估方法进行评价,包括平均意见得分(MOS)、基频误差(GPE)和频谱失真(SD)。该模型通过随机选择说话人特征,能够生成具有不同音色的语音。