Although Singing Voice Synthesis (SVS) has made great strides with Text-to-Speech (TTS) techniques, multilingual singing voice modeling remains relatively unexplored. This paper presents BiSinger, a bilingual pop SVS system for English and Chinese Mandarin. Current systems require separate models per language and cannot accurately represent both Chinese and English, hindering code-switch SVS. To address this gap, we design a shared representation between Chinese and English singing voices, achieved by using the CMU dictionary with mapping rules. We fuse monolingual singing datasets with open-source singing voice conversion techniques to generate bilingual singing voices while also exploring the potential use of bilingual speech data. Experiments affirm that our language-independent representation and incorporation of related datasets enable a single model with enhanced performance in English and code-switch SVS while maintaining Chinese song performance. Audio samples are available at https://bisinger-svs.github.io.
翻译:尽管歌唱声音合成(SVS)借助文本转语音(TTS)技术取得了显著进展,多语言歌唱声音建模仍相对未被充分探索。本文提出BiSinger,一个面向英语和中文普通话的双语流行SVS系统。现有系统需要为每种语言单独建模,无法同时准确表示中文和英语,阻碍了语码转换SVS的发展。为填补这一空白,我们通过使用带映射规则的CMU词典,设计了中文与英语歌唱声音的共享表示。我们融合单语歌唱数据集与开源歌唱声音转换技术,生成双语歌唱声音,同时探索了双语语音数据的潜在用途。实验证实,我们的与语言无关表示及相关数据集的整合,使得单一模型在英语和语码转换SVS中性能提升,同时保持中文歌曲的表现。音频样本可访问https://bisinger-svs.github.io获取。