Cascaded speech translation (ST) systems suffer from error propagation when Automatic Speech Recognition (ASR) outputs incorrect transcripts. We present the first systematic categorization of ASR errors for Vietnamese ST, classifying substitution errors by phonetic cause and quantifying their impact on downstream Neural Machine Translation (NMT) performance using Linear Mixed-Effects Modelling. We confirm that most ASR substitution errors arise from phonetic confusions rather than random noise, and that these phonetic errors significantly degrade ST quality. Motivated by this finding, we propose Phonetically-Informed Data Augmentation (PiDA), which generates ASR-like corruptions by substituting words with phonetically similar alternatives using phonetic word embeddings. Fine-tuning on a PiDA-augmented version of FLEURS Vietnamese-English improves translation of erroneous ASR outputs (up to +2.04 BLEU over standard fine-tuning) while also slightly improving clean-text performance.
翻译:摘要:级联式语音翻译(ST)系统在自动语音识别(ASR)输出错误转录时面临错误传播问题。我们首次系统化分类越南语ST中的ASR错误,根据语音原因对替换错误进行分类,并采用线性混合效应模型量化其对下游神经机器翻译(NMT)性能的影响。研究证实,大多数ASR替换错误源于语音混淆而非随机噪声,且这些语音错误会显著降低ST质量。基于此发现,我们提出语音感知数据增强(PiDA)方法,通过利用语音词嵌入将单词替换为语音相似替代词,生成类似ASR的语料扰动。在FLEURS越南语-英语数据集的PiDA增强版本上进行微调,不仅能提升错误ASR输出的翻译质量(相较标准微调最高提升+2.04 BLEU值),还可轻微改善干净文本的翻译表现。