Judeo-Arabic refers to Arabic variants historically spoken by Jewish communities across the Arab world, primarily during the Middle Ages. Unlike standard Arabic, it is written in Hebrew script by Jewish writers and for Jewish audiences. Transliterating Judeo-Arabic into Arabic script is challenging due to ambiguous letter mappings, inconsistent orthographic conventions, and frequent code-switching into Hebrew. In this paper, we introduce a two-step approach to automatically transliterate Judeo-Arabic into Arabic script: simple character-level mapping followed by post-correction to address grammatical and orthographic errors. We also present the first benchmark evaluation of LLMs on this task. Finally, we show that transliteration enables Arabic NLP tools to perform morphosyntactic tagging and machine translation, which would have not been feasible on the original texts. We make our code and data publicly available.
翻译:犹太阿拉伯语指历史上阿拉伯世界犹太社群(主要在中世纪时期)使用的阿拉伯语变体。与标准阿拉伯语不同,它由犹太作者使用希伯来字母书写,面向犹太读者群体。将犹太阿拉伯语转写为阿拉伯字母面临多重挑战:字母映射存在歧义、正字法规范不一致、以及频繁夹杂希伯来语码转换。本文提出一种将犹太阿拉伯语自动转写为阿拉伯字母的两步法:先进行简单的字符级映射,再通过后校正处理语法和正字法错误。我们首次针对该任务建立了大语言模型的基准评估体系。最后,我们证明通过音译处理,阿拉伯语自然语言处理工具能够实现形态句法标注和机器翻译功能,这在原始文本上是无法实现的。我们已公开相关代码与数据。