Phonemization is a critical component in text-to-speech synthesis. Traditional approaches rely on deterministic transformations and lexica, while neural methods offer potential for higher generalization on out-of-vocabulary (OOV) terms. We introduce OLaPh (Optimal Language Phonemizer), a hybrid framework that integrates extensive multilingual lexica with advanced NLP techniques and a statistical subword segmentation function. Evaluations on the WikiPron benchmark show OLaPh significantly outperforms established baselines in overall accuracy and maintains robustness on OOV data through advanced fallback mechanisms. To further explore neural generalization, we utilize the framework to synthesize a high-consistency training corpus for an instruction-tuned Large Language Model (LLM). While the deterministic framework remains more accurate overall, the LLM demonstrates strong generalization, matching or partly exceeding the framework's performance. This suggests that the LLM successfully internalized phonetic intuitions from the synthetic data that transcend the framework's capabilities. Together, these tools provide a comprehensive, open-source resource for multilingual grapheme-to-phoneme conversion (G2P) research.
翻译:音素化是文本转语音合成中的关键组成部分。传统方法依赖于确定性转换和词典,而神经方法则有望在词汇外(OOV)术语上实现更高的泛化能力。我们提出OLaPh(最优语言音素化器),这是一个混合框架,集成了广泛的多语言词典、先进自然语言处理技术以及统计子词分割功能。在WikiPron基准上的评估表明,OLaPh在整体准确性上显著优于既有基线,并通过先进的回退机制在OOV数据上保持了鲁棒性。为进一步探索神经泛化能力,我们利用该框架为指令微调的大语言模型(LLM)生成了一个高一致性的训练语料库。虽然确定性框架在整体上更准确,但LLM展现了强大的泛化能力,其性能与框架相当甚至部分超越。这表明LLM成功内化了从合成数据中获得的超越框架能力的语音直觉。这些工具共同为多语言字形到音素转换(G2P)研究提供了全面的开源资源。