Singing technique conversion (STC) refers to the task of converting from one voice technique to another while leaving the original singer identity, melody, and linguistic components intact. Previous STC studies, as well as singing voice conversion research in general, have utilized convolutional autoencoders (CAEs) for conversion, but how the bottleneck width of the CAE affects the synthesis quality has not been thoroughly evaluated. To this end, we constructed a GAN-based multi-domain STC system which took advantage of the WORLD vocoder representation and the CAE architecture. We varied the bottleneck width of the CAE, and evaluated the conversion results subjectively. The model was trained on a Mandarin dataset which features four singers and four singing techniques: the chest voice, the falsetto, the raspy voice, and the whistle voice. The results show that a wider bottleneck corresponds to better articulation clarity but does not necessarily lead to higher likeness to the target technique. Among the four techniques, we also found that the whistle voice is the easiest target for conversion, while the other three techniques as a source produce more convincing conversion results than the whistle.
翻译:歌唱技法转换(STC)是指在不改变原始歌手身份、旋律和语言成分的前提下,从一种发声技法转换至另一种发声技法的任务。既往STC研究及泛歌唱声音转换研究多采用卷积自编码器(CAE)进行转换,但CAE瓶颈宽度对合成质量的影响尚未被充分评估。为此,我们构建了基于生成对抗网络(GAN)的多领域STC系统,该系统利用WORLD声码器表示与CAE架构。通过改变CAE瓶颈宽度,我们主观评估了转换结果。模型在包含四位歌手及四种歌唱技法(胸声、假声、嘶哑声与哨声)的中文数据集上进行训练。结果表明,较宽的瓶颈宽度对应更高的发音清晰度,但未必带来更接近目标技法的相似度。在四种技法中,我们还发现哨声是最易转换的目标技法,而其他三种技法作为源技法时产生的转换结果比哨声更具说服力。