Automatic Arabic diacritization is useful in many applications, ranging from reading support for language learners to accurate pronunciation predictor for downstream tasks like speech synthesis. While most of the previous works focused on models that operate on raw non-diacritized text, production systems can gain accuracy by first letting humans partly annotate ambiguous words. In this paper, we propose 2SDiac, a multi-source model that can effectively support optional diacritics in input to inform all predictions. We also introduce Guided Learning, a training scheme to leverage given diacritics in input with different levels of random masking. We show that the provided hints during test affect more output positions than those annotated. Moreover, experiments on two common benchmarks show that our approach i) greatly outperforms the baseline also when evaluated on non-diacritized text; and ii) achieves state-of-the-art results while reducing the parameter count by over 60%.
翻译:自动阿拉伯语标音在多种应用中非常有用,从为语言学习者提供阅读支持,到为语音合成等下游任务提供准确的发音预测。虽然以往的研究大多聚焦于处理原始无标音文本的模型,但生产系统可以通过首先让人类部分标注歧义词来提升准确性。本文提出2SDiac,这是一种多源模型,能够有效支持输入中的可选标音以指导所有预测。我们还引入了引导学习(Guided Learning),这是一种训练方案,利用输入中给定的标音,并通过不同级别的随机掩码进行处理。我们表明,测试期间提供的提示影响比标注位置更多的输出位置。此外,在两个常见基准上的实验表明,我们的方法i)在无标音文本评估时也大幅优于基线;ii)在减少参数数量超过60%的同时,达到了最先进的性能。