Empowering Communication: Speech Technology for Indian and Western Accents through AI-powered Speech Synthesis

Neural Text-to-speech (TTS) synthesis is a powerful technology that can generate speech using neural networks. One of the most remarkable features of TTS synthesis is its capability to produce speech in the voice of different speakers. This paper introduces voice cloning and speech synthesis https://pypi.org/project/voice-cloning/ an open-source python package for helping speech disorders to communicate more effectively as well as for professionals seeking to integrate voice cloning or speech synthesis capabilities into their projects. This package aims to generate synthetic speech that sounds like the natural voice of an individual, but it does not replace the natural human voice. The architecture of the system comprises a speaker verification system, a synthesizer, a vocoder, and noise reduction. Speaker verification system trained on a varied set of speakers to achieve optimal generalization performance without relying on transcriptions. Synthesizer is trained using both audio and transcriptions that generate Mel spectrogram from a text and vocoder which converts the generated Mel Spectrogram into corresponding audio signal. Then the audio signal is processed by a noise reduction algorithm to eliminate unwanted noise and enhance speech clarity. The performance of synthesized speech from seen and unseen speakers are then evaluated using subjective and objective evaluation such as Mean Opinion Score (MOS), Gross Pitch Error (GPE), and Spectral distortion (SD). The model can create speech in distinct voices by including speaker characteristics that are chosen randomly.

翻译：神经文本转语音（TTS）合成是一种能够利用神经网络生成语音的强大技术。其最显著的特性之一，在于能够产生不同说话人声音的语音。本文介绍了voice克隆与语音合成工具包（https://pypi.org/project/voice-cloning/），这是一个开源Python工具包，旨在帮助言语障碍者更有效地进行沟通，同时服务于希望在项目中集成语音克隆或语音合成功能的专业人士。该工具包的目标是生成听起来像个人自然声音的合成语音，但它并不能替代自然的人声。系统架构包含说话人验证系统、合成器、声码器以及降噪模块。说话人验证系统在多样化的说话人集上进行训练，以实现无需依赖转录文本的最优泛化性能。合成器利用音频和转录文本进行训练，从文本生成梅尔频谱图；声码器则将生成的梅尔频谱图转换为对应的音频信号。随后，音频信号通过降噪算法处理，以消除不必要的噪声并提升语音清晰度。针对已见与未见说话人的合成语音性能，采用主观与客观评估方法进行评价，包括平均意见得分（MOS）、基频误差（GPE）和频谱失真（SD）。该模型通过随机选择说话人特征，能够生成具有不同音色的语音。

相关内容

语音合成

关注 491

语音合成（Speech Synthesis），也称为文语转换（Text-to-Speech, TTS,它是将任意的输入文本转换成自然流畅的语音输出。语音合成涉及到人工智能、心理学、声学、语言学、数字信号处理、计算机科学等多个学科技术，是信息处理领域中的一项前沿技术。随着计算机技术的不断提高，语音合成技术从早期的共振峰合成,逐步发展为波形拼接合成和统计参数语音合成，再发展到混合语音合成；合成语音的质量、自然度已经得到明显提高，基本能满足一些特定场合的应用需求。目前，语音合成技术在银行、医院等的信息播报系统、汽车导航系统、自动应答呼叫中心等都有广泛应用，取得了巨大的经济效益。另外，随着智能手机、MP3、PDA 等与我们生活密切相关的媒介的大量涌现，语音合成的应用也在逐渐向娱乐、语音教学、康复治疗等领域深入。可以说语音合成正在影响着人们生活的方方面面。

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日