People worldwide use language in subtle and complex ways to express emotions. While emotion recognition -- an umbrella term for several NLP tasks -- significantly impacts different applications in NLP and other fields, most work in the area is focused on high-resource languages. Therefore, this has led to major disparities in research and proposed solutions, especially for low-resource languages that suffer from the lack of high-quality datasets. In this paper, we present BRIGHTER-- a collection of multilabeled emotion-annotated datasets in 28 different languages. BRIGHTER covers predominantly low-resource languages from Africa, Asia, Eastern Europe, and Latin America, with instances from various domains annotated by fluent speakers. We describe the data collection and annotation processes and the challenges of building these datasets. Then, we report different experimental results for monolingual and crosslingual multi-label emotion identification, as well as intensity-level emotion recognition. We investigate results with and without using LLMs and analyse the large variability in performance across languages and text domains. We show that BRIGHTER datasets are a step towards bridging the gap in text-based emotion recognition and discuss their impact and utility.
翻译:世界各地的人们以微妙而复杂的方式运用语言来表达情感。尽管情感识别——作为多项自然语言处理任务的统称——对自然语言处理及其他领域的各类应用具有重要影响,但该领域的研究大多集中于高资源语言。这导致了研究及解决方案的严重失衡,尤其对于缺乏高质量数据集的低资源语言而言。本文提出BRIGHTER——一个涵盖28种语言的多标签情感标注数据集集合。BRIGHTER主要覆盖来自非洲、亚洲、东欧和拉丁美洲的低资源语言,所有语料均由母语者从多领域文本中进行标注。我们详细阐述了数据收集与标注流程,以及构建这些数据集过程中面临的挑战。随后,我们报告了单语与跨语言多标签情感识别以及强度级情感识别的多项实验结果。我们探究了使用与不使用大语言模型时的结果差异,并分析了不同语言及文本领域间存在的显著性能波动。研究表明,BRIGHTER数据集为弥合基于文本的情感识别领域的鸿沟迈出了重要一步,我们同时探讨了其影响力与应用价值。