Optical Character Recognition and Transcription of Berber Signs from Images in a Low-Resource Language Amazigh

The Berber, or Amazigh language family is a low-resource North African vernacular language spoken by the indigenous Berber ethnic group. It has its own unique alphabet called Tifinagh used across Berber communities in Morocco, Algeria, and others. The Afroasiatic language Berber is spoken by 14 million people, yet lacks adequate representation in education, research, web applications etc. For instance, there is no option of translation to or from Amazigh / Berber on Google Translate, which hosts over 100 languages today. Consequently, we do not find specialized educational apps, L2 (2nd language learner) acquisition, automated language translation, and remote-access facilities enabled in Berber. Motivated by this background, we propose a supervised approach called DaToBS for Detection and Transcription of Berber Signs. The DaToBS approach entails the automatic recognition and transcription of Tifinagh characters from signs in photographs of natural environments. This is achieved by self-creating a corpus of 1862 pre-processed character images; curating the corpus with human-guided annotation; and feeding it into an OCR model via the deployment of CNN for deep learning based on computer vision models. We deploy computer vision modeling (rather than language models) because there are pictorial symbols in this alphabet, this deployment being a novel aspect of our work. The DaToBS experimentation and analyses yield over 92 percent accuracy in our research. To the best of our knowledge, ours is among the first few works in the automated transcription of Berber signs from roadside images with deep learning, yielding high accuracy. This can pave the way for developing pedagogical applications in the Berber language, thereby addressing an important goal of outreach to underrepresented communities via AI in education.

翻译：柏柏尔语（阿马齐格语系）是一种低资源的北非本土语言，由土著柏柏尔族群使用。该语言拥有名为提非纳文的独特字母系统，在摩洛哥、阿尔及利亚等地的柏柏尔社区中广泛使用。作为亚非语系语言，柏柏尔语拥有1400万使用者，但在教育、研究、网络应用等领域尚未获得充分体现。例如，当前支持100多种语言的谷歌翻译中，并未提供阿马齐格语/柏柏尔语的互译选项。因此，该语言缺乏专门的教育应用、第二语言习得工具、自动翻译系统及远程访问设施。基于这一背景，我们提出名为DaToBS的监督学习方法，用于柏柏尔文字符号的检测与转写。DaToBS方法通过以下步骤实现自然场景照片中提非纳字符的自动识别与转写：自主构建包含1862个预处理字符图像的语料库；通过人工引导标注对语料库进行整理；并基于计算机视觉模型部署卷积神经网络，将其输入光学字符识别模型。我们采用计算机视觉建模（而非语言模型）的原因在于该字母系统包含象形符号，这一部署方式构成了本研究的新颖之处。实验与分析表明，DaToBS的识别准确率超过92%。据我们所知，本研究是首批利用深度学习对道路图像中柏柏尔文字符实现高精度自动转写的成果之一。该工作可为开发柏柏尔语教育应用铺平道路，进而通过人工智能实现教育领域对弱势语言群体的包容性覆盖这一重要目标。