High-quality parallel corpora are essential for Machine Translation (MT) research and translation teaching. However, Arabic-English resources remain scarce and existing datasets mainly consist of simple one-to-one mappings. In this paper, we present AlignAR, a generative sentence alignment method, and a new Arabic-English dataset comprising simple legal and complex literary parallel texts. Our evaluation demonstrates that "Easy" datasets lack the discriminatory power to fully assess alignment methods. By reducing one-to-one mappings in our "Hard" subset, we exposed the limitations of traditional alignment methods. In contrast, LLM-based approaches demonstrated better robustness, achieving an overall F1-score of 85.5%, a nearly 9% improvement over previous methods. Our datasets and codes are open-sourced at https://github.com/XXX.
翻译:高质量平行语料库对机器翻译研究和翻译教学至关重要。然而,阿拉伯语-英语资源仍然稀缺,现有数据集主要包含简单的一对一映射关系。本文提出AlignAR——一种生成式句子对齐方法,并构建了一个包含简单法律文本与复杂文学文本的新型阿拉伯语-英语数据集。评估结果表明,"简易"数据集缺乏充分评估对齐方法所需的区分能力。通过减少"困难"子集中的一对一映射,我们揭示了传统对齐方法的局限性。相比之下,基于大语言模型的方法展现出更好的鲁棒性,整体F1分数达到85.5%,较先前方法提升近9%。我们的数据集与代码已在https://github.com/XXX开源。