Numerous models have shown great success in the fields of speech recognition as well as speech synthesis, but models for speech to speech processing have not been heavily explored. We propose Speech to Speech Synthesis Network (STSSN), a model based on current state of the art systems that fuses the two disciplines in order to perform effective speech to speech style transfer for the purpose of voice impersonation. We show that our proposed model is quite powerful, and succeeds in generating realistic audio samples despite a number of drawbacks in its capacity. We benchmark our proposed model by comparing it with a generative adversarial model which accomplishes a similar task, and show that ours produces more convincing results.
翻译:众多模型已在语音识别及语音合成领域展现出卓越成效,但针对语音到语音处理的模型尚未得到充分探索。本文提出语音到语音合成网络(STSSN),该模型基于当前最先进的系统架构,融合两个学科领域以实现高效的语音到语音风格转换,从而达成语音模仿的目的。实验表明,所提出的模型具备强大性能,尽管存在若干能力局限,仍能成功生成逼真的音频样本。我们通过将所提模型与完成类似任务的生成对抗模型进行对比测试,证明本模型能产生更具说服力的结果。