We present MaskLID, a simple, yet effective, code-switching (CS) language identification (LID) method. MaskLID does not require any training and is designed to complement current high-performance sentence-level LIDs. Sentence-level LIDs are classifiers trained on monolingual texts to provide single labels, typically using a softmax layer to turn scores into probabilities. However, in cases where a sentence is composed in both L1 and L2 languages, the LID classifier often only returns the dominant label L1. To address this limitation, MaskLID employs a strategy to mask text features associated with L1, allowing the LID to classify the text as L2 in the next round. This method uses the LID itself to identify the features that require masking and does not rely on any external resource. In this work, we explore the use of MaskLID for two open-source LIDs (GlotLID and OpenLID), that are both based on the FastText architecture. Code and demo are available at https://github.com/cisnlp/MaskLID.
翻译:本文提出MaskLID,一种简单而有效的语码转换(CS)语言识别(LID)方法。该方法无需任何训练,旨在补充当前高性能的句子级语言识别系统。句子级语言识别器是在单语文本上训练的分类器,通常使用softmax层将分数转换为概率以提供单一标签。然而,当句子同时包含L1和L2语言时,语言识别分类器往往仅返回主导标签L1。为克服此局限,MaskLID采用掩码策略遮蔽与L1相关的文本特征,使语言识别器在下一轮迭代中将文本分类为L2。该方法利用语言识别器自身识别需要掩码的特征,且不依赖任何外部资源。本研究探索了MaskLID在两种基于FastText架构的开源语言识别系统(GlotLID与OpenLID)中的应用。代码与演示见https://github.com/cisnlp/MaskLID。