Of the over 7,000 languages spoken in the world, commercial language identification (LID) systems only reliably identify a few hundred in written form. Research-grade systems extend this coverage under certain circumstances, but for most languages coverage remains patchy or nonexistent. This position paper argues that this situation is largely self-imposed. In particular, it arises from a persistent framing of LID as decontextualized text classification, which obscures the central role of prior probability estimation and is reinforced by institutional incentives that favor global, fixed-prior models. We argue that improving coverage for tail languages requires rethinking LID as a routing problem and developing principled ways to incorporate environmental cues that make languages locally plausible.
翻译:全球现存超过7000种语言,而商用语言识别系统仅能可靠识别其中数百种书面形式。研究级系统虽能在特定条件下扩展覆盖范围,但对多数语言而言,其识别能力仍存在断层或完全缺失。本立场文件认为,这种状况很大程度上是自我设限所致。具体而言,问题源于长期将语言识别框定为去语境化的文本分类任务,这种框架既遮蔽了先验概率估计的核心作用,又受到机构激励机制的影响——这些机制更倾向于支持全局性、固定先验的模型。我们认为,要提升长尾语言的覆盖能力,必须将语言识别重新构想为路由问题,并建立系统化方法以整合使语言在局部语境中具备合理性的环境线索。