Automatic text-based diacritic restoration models generally have high diacritic error rates when applied to speech transcripts as a result of domain and style shifts in spoken language. In this work, we explore the possibility of improving the performance of automatic diacritic restoration when applied to speech data by utilizing the parallel spoken utterances. In particular, we use the pre-trained Whisper ASR model fine-tuned on relatively small amounts of diacritized Arabic speech data to produce rough diacritized transcripts for the speech utterances, which we then use as an additional input for a transformer-based diacritic restoration model. The proposed model consistently improve diacritic restoration performance compared to an equivalent text-only model, with at least 5\% absolute reduction in diacritic error rate within the same domain and on two out-of-domain test sets. Our results underscore the inadequacy of current text-based diacritic restoration models for speech data sets and provide a new baseline for speech-based diacritic restoration.
翻译:基于文本的自动变音符号恢复模型在应用于语音转录时,由于口语领域的领域和风格变化,通常具有较高的变音错误率。在本研究中,我们探索了通过利用并行口语话语来改善自动变音符号恢复在语音数据上的性能的可能性。具体而言,我们使用预训练的Whisper ASR模型,在少量带变音符号的阿拉伯语语音数据上微调,为语音话语生成粗略的带变音符号转录,然后将其作为基于Transformer的变音符号恢复模型的额外输入。与仅基于文本的模型相比,所提出的模型在相同领域以及两个域外测试集上始终能提高变音符号恢复性能,变音错误率至少绝对降低5%。我们的结果强调了当前基于文本的变音符号恢复模型在语音数据集上的不足,并为基于语音的变音符号恢复提供了新的基准。