Automatic text-based diacritic restoration models generally have high diacritic error rates when applied to speech transcripts as a result of domain and style shifts in spoken language. In this work, we explore the possibility of improving the performance of automatic diacritic restoration when applied to speech data by utilizing parallel spoken utterances. In particular, we use the pre-trained Whisper ASR model fine-tuned on relatively small amounts of diacritized Arabic speech data to produce rough diacritized transcripts for the speech utterances, which we then use as an additional input for diacritic restoration models. The proposed framework consistently improves diacritic restoration performance compared to text-only baselines. Our results highlight the inadequacy of current text-based diacritic restoration models for speech data sets and provide a new baseline for speech-based diacritic restoration.
翻译:基于文本的变音符自动恢复模型在应用于语音转写时,由于口语领域的风格迁移,通常具有较高的变音符错误率。本研究探索利用平行语音语句提升自动变音符恢复在语音数据上的性能。具体而言,我们采用预训练的Whisper ASR模型,通过少量带变音符的阿拉伯语语音数据进行微调,生成语音语句的粗略变音符转写,并将其作为变音符恢复模型的附加输入。相较于纯文本基线模型,所提框架持续提升了变音符恢复性能。实验结果揭示了当前基于文本的变音符恢复模型在语音数据集上的局限性,并为基于语音的变音符恢复建立了新基线。