Machine learning models often fail under distribution shifts, a problem exacerbated in low-resource settings where limited data restricts robust generalization. Domain generalization(DG) methods address this challenge by learning representations that remain invariant across domains, frequently leveraging causal principles. In this work, we study two causal DG approaches for low-resource natural language processing. First, we apply causal data augmentation using GPT-4o-mini to generate counterfactual paraphrases for sentiment classification on the NaijaSenti Twitter corpus in Yoruba and Igbo. Second, we investigate invariant causal representation learning with the Debiasing in Aspect Review (DINER) framework for aspect-based sentiment analysis. We extend DINER to a multilingual setting by introducing Afri-SemEval, a dataset of 17 languages translated from SemEval-2014 Task. Experiments show improved robustness to unseen domains, with consistent gains from counterfactual augmentation and enhanced out-of-distribution performance from causal representation learning across multiple languages.
翻译:机器学习模型在分布偏移下常常失效,这一问题在低资源场景中尤为突出,因为有限的数据限制了模型的鲁棒泛化能力。领域泛化方法通过学习跨领域保持不变的表示来应对这一挑战,通常借助因果原理。本研究探讨了两种面向低资源自然语言处理的因果领域泛化方法。首先,我们利用GPT-4o-mini进行因果数据增强,为约鲁巴语和伊博语的NaijaSenti推特语料库情感分类任务生成反事实释义样本。其次,我们基于Debiasing in Aspect Review框架研究不变因果表示学习,并将其应用于方面级情感分析任务。通过引入Afri-SemEval数据集(包含从SemEval-2014任务翻译的17种语言),我们将DINER框架扩展至多语言场景。实验表明:反事实增强能持续提升模型对未见领域的鲁棒性,而因果表示学习则在跨多种语言的任务中显著增强了分布外泛化性能。