The mixing of two or more languages is called Code-Mixing (CM). CM is a social norm in multilingual societies. Neural Language Models (NLMs) like transformers have been very effective on many NLP tasks. However, NLM for CM is an under-explored area. Though transformers are capable and powerful, they cannot always encode positional/sequential information since they are non-recurrent. Therefore, to enrich word information and incorporate positional information, positional encoding is defined. We hypothesize that Switching Points (SPs), i.e., junctions in the text where the language switches (L1 -> L2 or L2-> L1), pose a challenge for CM Language Models (LMs), and hence give special emphasis to switching points in the modeling process. We experiment with several positional encoding mechanisms and show that rotatory positional encodings along with switching point information yield the best results. We introduce CONFLATOR: a neural language modeling approach for code-mixed languages. CONFLATOR tries to learn to emphasize switching points using smarter positional encoding, both at unigram and bigram levels. CONFLATOR outperforms the state-of-the-art on two tasks based on code-mixed Hindi and English (Hinglish): (i) sentiment analysis and (ii) machine translation.
翻译:两种或多种语言的混合称为代码混合(Code-Mixing, CM),在多语言社会中是一种社会规范。诸如Transformer等神经语言模型(NLMs)在许多NLP任务中表现出色,然而,针对代码混合的神经语言模型仍是一个尚未充分探索的领域。尽管Transformer功能强大且高效,但由于其非循环特性,无法始终对位置/序列信息进行编码。因此,为了丰富单词信息并融入位置信息,需要定义位置编码。我们假设切换点(Switching Points, SPs),即文本中语言发生切换(L1→L2或L2→L1)的节点,对代码混合语言模型构成了挑战,因此在建模过程中对其给予特殊关注。我们实验了多种位置编码机制,结果表明,结合切换点信息的旋转位置编码效果最佳。我们提出CONFLATOR:一种面向代码混合语言的神经语言建模方法。CONFLATOR尝试通过更智能的位置编码(在单字和双字级别)来学习强调切换点。在基于印地语和英语混合(Hinglish)的两项任务中,CONFLATOR超越了现有最优水平:(i)情感分析和(ii)机器翻译。