BRIGHTER: BRIdging the Gap in Human-Annotated Textual Emotion Recognition Datasets for 28 Languages

Shamsuddeen Hassan Muhammad,Nedjma Ousidhoum,Idris Abdulmumin,Jan Philip Wahle,Terry Ruas,Meriem Beloucif,Christine de Kock,Nirmal Surange,Daniela Teodorescu,Ibrahim Said Ahmad,David Ifeoluwa Adelani,Alham Fikri Aji,Felermino D. M. A. Ali,Ilseyar Alimova,Vladimir Araujo,Nikolay Babakov,Naomi Baes,Ana-Maria Bucur,Andiswa Bukula,Guanqun Cao,Rodrigo Tufino Cardenas,Rendi Chevi,Chiamaka Ijeoma Chukwuneke,Alexandra Ciobotaru,Daryna Dementieva,Murja Sani Gadanya,Robert Geislinger,Bela Gipp,Oumaima Hourrane,Oana Ignat,Falalu Ibrahim Lawan,Rooweither Mabuya,Rahmad Mahendra,Vukosi Marivate,Andrew Piper,Alexander Panchenko,Charles Henrique Porto Ferreira,Vitaly Protasov,Samuel Rutunda,Manish Shrivastava,Aura Cristina Udrea,Lilian Diana Awuor Wanzare,Sophie Wu,Florian Valentin Wunderlich,Hanif Muhammad Zhafran,Tianhui Zhang,Yi Zhou,Saif M. Mohammad

from arxiv, 20 pages, under review

People worldwide use language in subtle and complex ways to express emotions. While emotion recognition -- an umbrella term for several NLP tasks -- significantly impacts different applications in NLP and other fields, most work in the area is focused on high-resource languages. Therefore, this has led to major disparities in research and proposed solutions, especially for low-resource languages that suffer from the lack of high-quality datasets. In this paper, we present BRIGHTER-- a collection of multilabeled emotion-annotated datasets in 28 different languages. BRIGHTER covers predominantly low-resource languages from Africa, Asia, Eastern Europe, and Latin America, with instances from various domains annotated by fluent speakers. We describe the data collection and annotation processes and the challenges of building these datasets. Then, we report different experimental results for monolingual and crosslingual multi-label emotion identification, as well as intensity-level emotion recognition. We investigate results with and without using LLMs and analyse the large variability in performance across languages and text domains. We show that BRIGHTER datasets are a step towards bridging the gap in text-based emotion recognition and discuss their impact and utility.

翻译：世界各地的人们以微妙而复杂的方式运用语言来表达情感。尽管情感识别——作为多项自然语言处理任务的统称——对自然语言处理及其他领域的各类应用具有重要影响，但该领域的研究大多集中于高资源语言。这导致了研究及解决方案的严重失衡，尤其对于缺乏高质量数据集的低资源语言而言。本文提出BRIGHTER——一个涵盖28种语言的多标签情感标注数据集集合。BRIGHTER主要覆盖来自非洲、亚洲、东欧和拉丁美洲的低资源语言，所有语料均由母语者从多领域文本中进行标注。我们详细阐述了数据收集与标注流程，以及构建这些数据集过程中面临的挑战。随后，我们报告了单语与跨语言多标签情感识别以及强度级情感识别的多项实验结果。我们探究了使用与不使用大语言模型时的结果差异，并分析了不同语言及文本领域间存在的显著性能波动。研究表明，BRIGHTER数据集为弥合基于文本的情感识别领域的鸿沟迈出了重要一步，我们同时探讨了其影响力与应用价值。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

13+阅读 · 2022年3月12日

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日