The Perso-Arabic scripts are a family of scripts that are widely adopted and used by various linguistic communities around the globe. Identifying various languages using such scripts is crucial to language technologies and challenging in low-resource setups. As such, this paper sheds light on the challenges of detecting languages using Perso-Arabic scripts, especially in bilingual communities where ``unconventional'' writing is practiced. To address this, we use a set of supervised techniques to classify sentences into their languages. Building on these, we also propose a hierarchical model that targets clusters of languages that are more often confused by the classifiers. Our experiment results indicate the effectiveness of our solutions.
翻译:波斯-阿拉伯文字是一类被全球多个语言群体广泛采用和使用的文字系统。识别使用此类文字的各类语言对于语言技术至关重要,且在低资源场景下具有挑战性。为此,本文聚焦于检测使用波斯-阿拉伯文字的语言所面临的挑战,尤其是在存在“非常规”书写实践的汉语社区中。为解决这一问题,我们采用一组监督技术对句子进行语言分类。在此基础上,我们还提出了一种分层模型,专门针对分类器常混淆的语言簇。实验结果表明了本方案的有效性。