ConCSE: Unified Contrastive Learning and Augmentation for Code-Switched Embeddings

This paper examines the Code-Switching (CS) phenomenon where two languages intertwine within a single utterance. There exists a noticeable need for research on the CS between English and Korean. We highlight that the current Equivalence Constraint (EC) theory for CS in other languages may only partially capture English-Korean CS complexities due to the intrinsic grammatical differences between the languages. We introduce a novel Koglish dataset tailored for English-Korean CS scenarios to mitigate such challenges. First, we constructed the Koglish-GLUE dataset to demonstrate the importance and need for CS datasets in various tasks. We found the differential outcomes of various foundation multilingual language models when trained on a monolingual versus a CS dataset. Motivated by this, we hypothesized that SimCSE, which has shown strengths in monolingual sentence embedding, would have limitations in CS scenarios. We construct a novel Koglish-NLI (Natural Language Inference) dataset using a CS augmentation-based approach to verify this. From this CS-augmented dataset Koglish-NLI, we propose a unified contrastive learning and augmentation method for code-switched embeddings, ConCSE, highlighting the semantics of CS sentences. Experimental results validate the proposed ConCSE with an average performance enhancement of 1.77\% on the Koglish-STS(Semantic Textual Similarity) tasks.

翻译：本文研究语码转换现象，即两种语言在同一话语中交织使用。目前针对英语与韩语之间语码转换的研究存在明显不足。我们指出，由于英语与韩语之间固有的语法差异，现有适用于其他语言的语码转换等价约束理论可能仅能部分捕捉英韩语码转换的复杂性。为应对这一挑战，我们引入一个专为英韩语码转换场景设计的新型Koglish数据集。首先，我们构建了Koglish-GLUE数据集，以证明语码转换数据集在多种任务中的重要性与必要性。通过实验发现，各类基础多语言模型在单语数据集与语码转换数据集上训练时会产生差异化结果。受此启发，我们假设在单语句子嵌入任务中表现优异的SimCSE模型，在语码转换场景中可能存在局限性。为验证该假设，我们采用基于语码转换增强的方法构建了新型Koglish-NLI（自然语言推理）数据集。基于此增强型语码转换数据集Koglish-NLI，我们提出面向语码转换嵌入的统一对比学习与增强方法ConCSE，该方法能有效凸显语码转换句子的语义特征。实验结果表明，所提出的ConCSE方法在Koglish-STS（语义文本相似度）任务中平均性能提升达1.77%。

相关内容

计算机科学

关注 56

计算机科学（Computer Science, CS）是系统性研究信息与计算的理论基础以及它们在计算机系统中如何实现与应用的实用技术的学科。它通常被形容为对那些创造、描述以及转换信息的算法处理的系统研究。计算机科学包含很多分支领域；其中一些，比如计算机图形学强调特定结果的计算，而另外一些，比如计算复杂性理论是学习计算问题的性质。还有一些领域专注于挑战怎样实现计算。比如程序设计语言理论学习描述计算的方法，而程序设计是应用特定的程序设计语言解决特定的计算问题，人机交互则是专注于挑战怎样使计算机和计算变得有用、可用，以及随时随地为人所用。 现代计算机科学( Computer Science)包含理论计算机科学和应用计算机科学两大分支。

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

13+阅读 · 2022年3月12日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日