We tackle the task of text-to-speech (TTS) in Hebrew. Traditional Hebrew contains Diacritics, which dictate the way individuals should pronounce given words, however, modern Hebrew rarely uses them. The lack of diacritics in modern Hebrew results in readers expected to conclude the correct pronunciation and understand which phonemes to use based on the context. This imposes a fundamental challenge on TTS systems to accurately map between text-to-speech. In this work, we propose to adopt a language modeling Diacritics-Free approach, for the task of Hebrew TTS. The model operates on discrete speech representations and is conditioned on a word-piece tokenizer. We optimize the proposed method using in-the-wild weakly supervised data and compare it to several diacritic-based TTS systems. Results suggest the proposed method is superior to the evaluated baselines considering both content preservation and naturalness of the generated speech. Samples can be found under the following link: pages.cs.huji.ac.il/adiyoss-lab/HebTTS/
翻译:本文致力于解决希伯来语的文本转语音任务。传统希伯来语包含变音符号,这些符号规定了特定单词的发音方式,然而现代希伯来语很少使用它们。现代希伯来语中变音符号的缺失导致读者需要根据上下文推断正确发音并确定应使用的音素。这给文本转语音系统在实现文本到语音的精确映射方面带来了根本性挑战。在本研究中,我们提出采用基于语言建模的无变音符号方法来完成希伯来语文本转语音任务。该模型基于离散语音表征进行操作,并以词片分词器为条件。我们使用弱监督的野生数据对提出的方法进行优化,并将其与多种基于变音符号的文本转语音系统进行比较。结果表明,在生成语音的内容保持度和自然度方面,所提出的方法均优于所评估的基线模型。样本可通过以下链接获取:pages.cs.huji.ac.il/adiyoss-lab/HebTTS/