Knowing the language of an input text/audio is a necessary first step for using almost every natural language processing (NLP) tool such as taggers, parsers, or translation systems. Language identification is a well-studied problem, sometimes even considered solved; in reality, most of the world's 7000 languages are not supported by current systems. This lack of representation affects large-scale data mining efforts and further exacerbates data shortage for low-resource languages. We take a step towards tackling the data bottleneck by compiling a corpus of over 50K parallel children's stories in 350+ languages and dialects, and the computation bottleneck by building lightweight hierarchical models for language identification. Our data can serve as benchmark data for language identification of short texts and for understudied translation directions such as those between Indian or African languages. Our proposed method, Hierarchical LIMIT, uses limited computation to expand coverage into excluded languages while maintaining prediction quality.
翻译:输入文本/音频的语言识别是使用几乎所有自然语言处理工具(如词性标注器、句法分析器或翻译系统)的首要步骤。语言识别作为研究较为充分的问题,甚至在某些场景下被视为已解决;然而实际情况是,全球7000种语言中绝大多数未被现有系统覆盖。这种代表缺失现象不仅影响大规模数据挖掘工作,更加剧了低资源语言的数据匮乏问题。我们通过构建包含350余种语言/方言、超过5万条平行儿童故事语料库以缓解数据瓶颈,同时建立轻量级分层模型应对计算瓶颈。该语料可作为短文本语言识别基准数据集,以及印地语、非洲语言等低资源翻译方向的研究参考。我们提出的分层LIMIT方法在保持预测质量的同时,以有限计算资源将语言覆盖范围扩展至被排除语言。