This paper introduces VoxHakka, a text-to-speech (TTS) system designed for Taiwanese Hakka, a critically under-resourced language spoken in Taiwan. Leveraging the YourTTS framework, VoxHakka achieves high naturalness and accuracy and low real-time factor in speech synthesis while supporting six distinct Hakka dialects. This is achieved by training the model with dialect-specific data, allowing for the generation of speaker-aware Hakka speech. To address the scarcity of publicly available Hakka speech corpora, we employed a cost-effective approach utilizing a web scraping pipeline coupled with automatic speech recognition (ASR)-based data cleaning techniques. This process ensured the acquisition of a high-quality, multi-speaker, multi-dialect dataset suitable for TTS training. Subjective listening tests conducted using comparative mean opinion scores (CMOS) demonstrate that VoxHakka significantly outperforms existing publicly available Hakka TTS systems in terms of pronunciation accuracy, tone correctness, and overall naturalness. This work represents a significant advancement in Hakka language technology and provides a valuable resource for language preservation and revitalization efforts.
翻译:本文介绍了VoxHakka,一个为台湾客家话设计的语音合成系统。台湾客家话是一种在台湾使用、资源极度匮乏的语言。基于YourTTS框架,VoxHakka在语音合成中实现了高自然度、高准确率和低实时因子,同时支持六种不同的客家话方言。这是通过使用特定方言数据训练模型实现的,从而能够生成具有说话人特征的客家话语音。针对公开可用的客家话语音语料库稀缺的问题,我们采用了一种经济高效的方法,利用网络爬取流程并结合基于自动语音识别的数据清洗技术。这一过程确保了获取一个适用于TTS训练的高质量、多说话人、多方言数据集。使用比较平均意见分进行的主观听力测试表明,VoxHakka在发音准确性、声调正确性和整体自然度方面显著优于现有的公开客家话TTS系统。这项工作代表了客家话语言技术的重要进展,并为语言保存和复兴工作提供了宝贵的资源。