Although Singing Voice Synthesis (SVS) has made great strides with Text-to-Speech (TTS) techniques, multilingual singing voice modeling remains relatively unexplored. This paper presents BiSinger, a bilingual SVS system for English and Chinese Mandarin. Current systems require separate models per language and cannot accurately represent both Chinese and English, hindering code-switch SVS. To address this gap, we design a shared representation between Chinese and English singing voices, achieved by using the CMU dictionary with mapping rules. We fuse monolingual singing datasets with established singing voice conversion techniques to generate bilingual singing voices while also exploring the potential use of bilingual speech data. Experiments affirm that our language-independent representation and incorporation of related datasets enable a single model with enhanced performance in English and code-switch SVS while maintaining Chinese song performance. Audio samples are available at https://bisinger-svs.github.io.
翻译:尽管歌声合成(SVS)借助文本到语音(TTS)技术取得了显著进展,但多语言歌声建模仍相对未被探索。本文提出BiSinger,一个面向英语和中文普通话的双语SVS系统。现有系统为每种语言需要单独的模型,且无法准确表示中文和英文,这阻碍了代码切换SVS的发展。为解决这一空白,我们通过使用带有映射规则的CMU词典,设计了一种中文和英文歌声之间的共享表示。我们利用已有的歌声转换技术融合单语言歌唱数据集,生成双语歌声,同时探索双语语音数据的潜在用途。实验证实,我们的语言无关表示以及相关数据集的整合,使得单个模型在英文和代码切换SVS任务中表现出增强的性能,同时保持了中文歌曲的表现。音频样本可访问https://bisinger-svs.github.io。