Vietnamese has a phonetic orthography, where each grapheme corresponds to at most one phoneme and vice versa. Exploiting this high grapheme-phoneme transparency, we propose ViSpeechFormer (\textbf{Vi}etnamese \textbf{Speech} Trans\textbf{Former}), a phoneme-based approach for Vietnamese Automatic Speech Recognition (ASR). To the best of our knowledge, this is the first Vietnamese ASR framework that explicitly models phonemic representations. Experiments on two publicly available Vietnamese ASR datasets show that ViSpeechFormer achieves strong performance, generalizes better to out-of-vocabulary words, and is less affected by training bias. This phoneme-based paradigm is also promising for other languages with phonetic orthographies. The code will be released upon acceptance of this paper.
翻译:越南语具有语音正字法特性,其中每个字位最多对应一个音素,反之亦然。利用这种高度的字位-音素透明性,我们提出了ViSpeechFormer(**Vi**etnamese **Speech** Trans**Former**),一种基于音素的越南语自动语音识别方法。据我们所知,这是首个显式建模音素表示的越南语ASR框架。在两个公开可用的越南语ASR数据集上的实验表明,ViSpeechFormer实现了强大的性能,对词汇表外词语具有更好的泛化能力,且受训练偏差的影响更小。这种基于音素的范式对于其他具有语音正字法的语言也颇具前景。代码将在本文录用后发布。