Code-switching entails mixing multiple languages. It is an increasingly occurring phenomenon in social media texts. Usually, code-mixed texts are written in a single script, even though the languages involved have different scripts. Pre-trained multilingual models primarily utilize the data in the native script of the language. In existing studies, the code-switched texts are utilized as they are. However, using the native script for each language can generate better representations of the text owing to the pre-trained knowledge. Therefore, a cross-language-script knowledge sharing architecture utilizing the cross attention and alignment of the representations of text in individual language scripts was proposed in this study. Experimental results on two different datasets containing Nepali-English and Hindi-English code-switched texts, demonstrate the effectiveness of the proposed method. The interpretation of the model using model explainability technique illustrates the sharing of language-specific knowledge between language-specific representations.
翻译:语码转换涉及混合多种语言,这在社交媒体文本中日益普遍。通常,语码混合文本以单一脚本书写,即使所涉语言拥有不同脚本。预训练多语言模型主要利用语言的本地脚本数据。现有研究直接使用原始语码混合文本,但基于预训练知识,采用各语言的本地脚本可生成更优文本表征。因此,本文提出一种跨语言脚本知识共享架构,通过交叉注意力机制与各语言脚本文本表征的对齐实现知识迁移。在包含尼泊尔语-英语和印地语-英语语码混合文本的两个不同数据集上的实验结果表明了所提方法的有效性。通过模型可解释性技术对模型进行解读,揭示了语言特定表征之间语言知识的共享机制。