This paper investigates the relationship between utterance sentiment and language choice in English-Tamil code-switched text, using methods from machine learning and statistical modelling. We apply a fine-tuned XLM-RoBERTa model for token-level language identification on 35,650 romanized YouTube comments from the DravidianCodeMix dataset, producing per-utterance measurements of English proportion and language switch frequency. Linear regression analysis reveals that positive utterances exhibit significantly greater English proportion (34.3%) than negative utterances (24.8%), and mixed-sentiment utterances show the highest language switch frequency when controlling for utterance length. These findings support the hypothesis that emotional content demonstrably influences language choice in multilingual code-switching settings, due to socio-linguistic associations of prestige and identity with embedded and matrix languages.
翻译:本文利用机器学习与统计建模方法,探究英语-泰米尔语码混合文本中话语情感与语言选择之间的关系。我们采用经过微调的XLM-RoBERTa模型,对DravidianCodeMix数据集中35,650条罗马化YouTube评论进行词元级语言识别,生成每条话语的英语比例与语言转换频率指标。线性回归分析表明:积极话语的英语比例(34.3%)显著高于消极话语(24.8%),而混合情感话语在控制话语长度后显示出最高的语言转换频率。这些发现支持了情绪内容在多语言语码转换场景中显著影响语言选择的假设,其内在机制与嵌入语言和基质语言所承载的社会语言声望及身份认同关联有关。