Grapheme-to-Phoneme (G2P) is an essential first step in any modern, high-quality Text-to-Speech (TTS) system. Most of the current G2P systems rely on carefully hand-crafted lexicons developed by experts. This poses a two-fold problem. Firstly, the lexicons are generated using a fixed phoneme set, usually, ARPABET or IPA, which might not be the most optimal way to represent phonemes for all languages. Secondly, the man-hours required to produce such an expert lexicon are very high. In this paper, we eliminate both of these issues by using recent advances in self-supervised learning to obtain data-driven phoneme representations instead of fixed representations. We compare our lexicon-free approach against strong baselines that utilize a well-crafted lexicon. Furthermore, we show that our data-driven lexicon-free method performs as good or even marginally better than the conventional rule-based or lexicon-based neural G2Ps in terms of Mean Opinion Score (MOS) while using no prior language lexicon or phoneme set, i.e. no linguistic expertise.
翻译:字素到音素(G2P)转换是任何现代高质量文语转换(TTS)系统的基础步骤。当前大多数G2P系统依赖专家精心手工构建的词典,这带来了两方面问题。首先,词典使用固定音素集(通常为ARPABET或国际音标)生成,这未必是所有语言音素表示的最优方式。其次,构建这类专业词典所需的人力成本极高。本文利用自监督学习的最新进展,摒弃固定表示,采用数据驱动的音素表示,从而解决了上述两个问题。我们将无词典方法与使用精心设计词典的强基线系统进行对比。实验结果表明,在平均意见得分(MOS)指标上,我们提出的数据驱动无词典方法——无需任何先验语言词典或音素集(即无需语言学专业知识)——其性能与传统的基于规则或基于词典的神经G2P系统相当,甚至略有提升。