This paper develops an approach to language identification in which the set of languages considered by the model depends on the geographic origin of the text in question. Given that many digital corpora can be geo-referenced at the country level, this paper formulates 16 region-specific models, each of which contains the languages expected to appear in countries within that region. These regional models also each include 31 widely-spoken international languages in order to ensure coverage of these linguae francae regardless of location. An upstream evaluation using traditional language identification testing data shows an improvement in f-score ranging from 1.7 points (Southeast Asia) to as much as 10.4 points (North Africa). A downstream evaluation on social media data shows that this improved performance has a significant impact on the language labels which are applied to large real-world corpora. The result is a highly-accurate model that covers 916 languages at a sample size of 50 characters, the performance improved by incorporating geographic information into the model.
翻译:本文提出一种基于地理信息引导的语言识别方法,该方法中模型所考虑的语言集合取决于文本的地理来源。鉴于许多数字语料库可在国家层面进行地理定位,本文构建了16个区域专用模型,每个模型包含该地区各国预期出现的语言。这些区域模型还各自纳入了31种广泛使用的国际语言,以确保无论地理位置如何都能覆盖这些通用语言。使用传统语言识别测试数据进行上游评估显示,其F值提升幅度从1.7个百分点(东南亚地区)到10.4个百分点(北非地区)不等。基于社交媒体数据的下游评估表明,这种性能提升对应用于大型真实语料库的语言标签产生了显著影响。最终得到一个涵盖916种语言、采样量为50字符的高精度模型,其性能通过在地理信息融入模型而得到提升。