One of the key issues in Mandarin Chinese text-to-speech (TTS) systems is polyphone disambiguation when doing grapheme-to-phoneme (G2P) conversion. In this paper, we introduce a novel method to solve the problem as a generation task. Following the trending research of large language models (LLM) and prompt learning, the proposed method consists of three modules. Retrieval module incorporates external knowledge which is a multi-level semantic dictionary of Chinese polyphonic characters to format the sentence into a prompt. Generation module adopts the decoder-only Transformer architecture to induce the target text. Postprocess module corrects the generated text into a valid result if needed. Experimental results show that our method outperforms the existing methods on a public dataset called CPP. We also empirically study the impacts of different templates of the prompt, different sizes of training data, and whether to incorporate external knowledge.
翻译:中文文本转语音(TTS)系统中的关键问题之一,是在字形到音素(G2P)转换过程中进行多音字消歧。本文提出了一种创新方法,将其作为生成任务加以解决。遵循大语言模型(LLM)和提示学习的当前研究趋势,所提方法包含三个模块:检索模块引入外部知识——即中文多音字的多层级语义词典,将句子格式化为提示;生成模块采用仅含解码器的Transformer架构,诱导生成目标文本;后处理模块在必要时对生成文本进行修正,以确保得到有效结果。实验结果表明,在公开数据集CPP上,本方法优于现有方法。我们还通过实证研究了不同提示模板、不同训练数据规模以及是否引入外部知识的影响。