Low-resource automatic speech recognition (ASR) continues to pose significant challenges, primarily due to the limited availability of transcribed data for numerous languages. While a wealth of spoken content is accessible in television dramas and online videos, Taiwanese Hokkien exemplifies this issue, with transcriptions often being scarce and the majority of available subtitles provided only in Mandarin. To address this deficiency, we introduce TG-ASR for Taiwanese Hokkien drama speech recognition, a translation-guided ASR framework that utilizes multilingual translation embeddings to enhance recognition performance in low-resource environments. The framework is centered around the parallel gated cross-attention (PGCA) mechanism, which adaptively integrates embeddings from various auxiliary languages into the ASR decoder. This mechanism facilitates robust cross-linguistic semantic guidance while ensuring stable optimization and minimizing interference between languages. To support ongoing research initiatives, we present YT-THDC, a 30-hour corpus of Taiwanese Hokkien drama speech with aligned Mandarin subtitles and manually verified Taiwanese Hokkien transcriptions. Comprehensive experiments and analyses identify the auxiliary languages that most effectively enhance ASR performance, achieving a 14.77% relative reduction in character error rate and demonstrating the efficacy of translation-guided learning for underrepresented languages in practical applications.
翻译:低资源自动语音识别(ASR)仍然面临重大挑战,这主要源于许多语言缺乏足够的转录数据。尽管电视剧和在线视频中存在大量口语内容,但以台湾闽南语为例,其转录文本往往稀缺,且大部分可用字幕仅提供普通话版本。为弥补这一不足,我们针对台湾闽南语电视剧语音识别提出了TG-ASR框架——一种利用多语言翻译嵌入在低资源环境下提升识别性能的翻译引导式ASR框架。该框架以并行门控交叉注意力(PGCA)机制为核心,该机制能够自适应地将多种辅助语言的嵌入向量整合到ASR解码器中。该机制在实现稳健的跨语言语义引导的同时,确保了优化的稳定性,并最大限度地减少了语言间的相互干扰。为支持持续的研究工作,我们构建了YT-THDC语料库——一个包含30小时台湾闽南语电视剧语音、对齐的普通话字幕及人工核验的台湾闽南语转录文本的语料库。全面的实验与分析确定了最能有效提升ASR性能的辅助语言,实现了14.77%的字错误率相对降低,证明了翻译引导式学习在实际应用中对资源匮乏语言的有效性。