Language Identification with a Reciprocal Rank Classifier

Language identification is a critical component of language processing pipelines (Jauhiainen et al.,2019) and is not a solved problem in real-world settings. We present a lightweight and effective language identifier that is robust to changes of domain and to the absence of copious training data. The key idea for classification is that the reciprocal of the rank in a frequency table makes an effective additive feature score, hence the term Reciprocal Rank Classifier (RRC). The key finding for language classification is that ranked lists of words and frequencies of characters form a sufficient and robust representation of the regularities of key languages and their orthographies. We test this on two 22-language data sets and demonstrate zero-effort domain adaptation from a Wikipedia training set to a Twitter test set. When trained on Wikipedia but applied to Twitter the macro-averaged F1-score of a conventionally trained SVM classifier drops from 90.9% to 77.7%. By contrast, the macro F1-score of RRC drops only from 93.1% to 90.6%. These classifiers are compared with those from fastText and langid. The RRC performs better than these established systems in most experiments, especially on short Wikipedia texts and Twitter. The RRC classifier can be improved for particular domains and conversational situations by adding words to the ranked lists. Using new terms learned from such conversations, we demonstrate a further 7.9% increase in accuracy of sample message classification, and 1.7% increase for conversation classification. Surprisingly, this made results on Twitter data slightly worse. The RRC classifier is available as an open source Python package (https://github.com/LivePersonInc/lplangid).

翻译：语言识别是语言处理管道(Jauhiainenen et al., 2019)的关键组成部分,语言处理管道中的语言识别是语言处理管道的关键组成部分(Jahiainenen 等人,2019),不是现实世界环境中解决的一个问题。我们展示了一个对域变化和缺乏大量培训数据强的轻量、有效语言识别器。分类的关键理念是,在频率表格中,排名的对等性使一个有效的添加性特征分分数,从而成为语言处理管道(Jauhiainen 等人,2019年)的一个关键组成部分。语言处理管道(Jahiahiainenen 等人等,2019年)是语言处理管道(Jahiahiain 等)的重要组成部分。我们用两个22种语言数据集进行测试,并展示从一个维基培训集到一个推特测试数据集集的零超强度语言标识。当在一个频率表中,对一个经过常规培训的SVMMGL 分类仪的对等中,从90.9%下降到77.7%。相比之下, RRC 的 RRC 的宏观 F1- 级分类中,仅从93.1%降至90. 和90.6 。。新的源的 ORC, 更的 RRC 的 RRC, 的 R1- 更更更更。这些分类只能只能只能只能, 只能,,,,,,,, 更,,,, 。。。。。。。。。这些分类的 RRC 。。。。,, 。。。,,,,,,,, 的 R,,,,, 的 R 的 R 的 R,,,,,,,,,, 新的,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,