This paper examines the Code-Switching (CS) phenomenon where two languages intertwine within a single utterance. There exists a noticeable need for research on the CS between English and Korean. We highlight that the current Equivalence Constraint (EC) theory for CS in other languages may only partially capture English-Korean CS complexities due to the intrinsic grammatical differences between the languages. We introduce a novel Koglish dataset tailored for English-Korean CS scenarios to mitigate such challenges. First, we constructed the Koglish-GLUE dataset to demonstrate the importance and need for CS datasets in various tasks. We found the differential outcomes of various foundation multilingual language models when trained on a monolingual versus a CS dataset. Motivated by this, we hypothesized that SimCSE, which has shown strengths in monolingual sentence embedding, would have limitations in CS scenarios. We construct a novel Koglish-NLI (Natural Language Inference) dataset using a CS augmentation-based approach to verify this. From this CS-augmented dataset Koglish-NLI, we propose a unified contrastive learning and augmentation method for code-switched embeddings, ConCSE, highlighting the semantics of CS sentences. Experimental results validate the proposed ConCSE with an average performance enhancement of 1.77\% on the Koglish-STS(Semantic Textual Similarity) tasks.
翻译:本文研究语码转换现象,即两种语言在同一话语中交织使用。目前针对英语与韩语之间语码转换的研究存在明显不足。我们指出,由于英语与韩语之间固有的语法差异,现有适用于其他语言的语码转换等价约束理论可能仅能部分捕捉英韩语码转换的复杂性。为应对这一挑战,我们引入一个专为英韩语码转换场景设计的新型Koglish数据集。首先,我们构建了Koglish-GLUE数据集,以证明语码转换数据集在多种任务中的重要性与必要性。通过实验发现,各类基础多语言模型在单语数据集与语码转换数据集上训练时会产生差异化结果。受此启发,我们假设在单语句子嵌入任务中表现优异的SimCSE模型,在语码转换场景中可能存在局限性。为验证该假设,我们采用基于语码转换增强的方法构建了新型Koglish-NLI(自然语言推理)数据集。基于此增强型语码转换数据集Koglish-NLI,我们提出面向语码转换嵌入的统一对比学习与增强方法ConCSE,该方法能有效凸显语码转换句子的语义特征。实验结果表明,所提出的ConCSE方法在Koglish-STS(语义文本相似度)任务中平均性能提升达1.77%。