Code-switching (CS) is the process of speakers interchanging between two or more languages which in the modern world becomes increasingly common. In order to better describe CS speech the Matrix Language Frame (MLF) theory introduces the concept of a Matrix Language, which is the language that provides the grammatical structure for a CS utterance. In this work the MLF theory was used to develop systems for Matrix Language Identity (MLID) determination. The MLID of English/Mandarin and English/Spanish CS text and speech was compared to acoustic language identity (LID), which is a typical way to identify a language in monolingual utterances. MLID predictors from audio show higher correlation with the textual principles than LID in all cases while also outperforming LID in an MLID recognition task based on F1 macro (60%) and correlation score (0.38). This novel approach has identified that non-English languages (Mandarin and Spanish) are preferred over the English language as the ML contrary to the monolingual choice of LID.
翻译:代码切换(CS)是指说话者在两种或多种语言之间交替使用的现象,在现代社会中日益普遍。为了更好地描述CS语音,矩阵语言框架(MLF)理论引入了矩阵语言的概念,即提供CS话语语法结构的语言。本研究基于MLF理论开发了矩阵语言身份(MLID)判定系统。针对英语/普通话和英语/西班牙语的CS文本与语音,将MLID与声学语言身份(LID)进行了对比分析——后者是单语话语中典型的语言识别方式。在所有案例中,基于音频的MLID预测器与文本原则的相关性均高于LID,同时在MLID识别任务中基于宏观F1分数(60%)和相关性得分(0.38)的表现也优于LID。这种新颖方法发现:与单语场景下的LID选择相反,非英语语言(普通话和西班牙语)比英语更倾向于被选作矩阵语言。