Code switching (CS) is a very common phenomenon in written and spoken communication but one that is handled poorly by many natural language processing applications. Looking to the application of building CS corpora, we explore CS language identification (LID) for corpus building. We make the task more realistic by scaling it to more languages and considering models with simpler architectures for faster inference. We also reformulate the task as a sentence-level multi-label tagging problem to make it more tractable. Having defined the task, we investigate three reasonable models for this task and define metrics which better reflect desired performance. We present empirical evidence that no current approach is adequate and finally provide recommendations for future work in this area.
翻译:摘要:语码转换(CS)是书面和口头交流中非常普遍的现象,但许多自然语言处理应用对其处理效果不佳。着眼于构建CS语料库的应用场景,我们探索了用于语料库构建的CS语言识别(LID)任务。通过将任务扩展到更多语言并考虑采用结构更简单的模型以实现更快的推理,我们使该任务更加贴近实际。此外,我们将该任务重新定义为句子级多标签分类问题,使其更易于处理。在明确任务定义后,我们研究了适用于该任务的三种合理模型,并定义了更能反映期望性能的评估指标。我们提供的实验证据表明,当前的方法均不够充分,并最终为该领域的未来研究提出了建议。