In the field of emotion analysis, much NLP research focuses on identifying a limited number of discrete emotion categories, often applied across languages. These basic sets, however, are rarely designed with textual data in mind, and culture, language, and dialect can influence how particular emotions are interpreted. In this work, we broaden our scope to a practically unbounded set of \textit{affective states}, which includes any terms that humans use to describe their experiences of feeling. We collect and publish MASIVE, a dataset of Reddit posts in English and Spanish containing over 1,000 unique affective states each. We then define the new problem of \textit{affective state identification} for language generation models framed as a masked span prediction task. On this task, we find that smaller finetuned multilingual models outperform much larger LLMs, even on region-specific Spanish affective states. Additionally, we show that pretraining on MASIVE improves model performance on existing emotion benchmarks. Finally, through machine translation experiments, we find that native speaker-written data is vital to good performance on this task.
翻译:在情感分析领域,许多自然语言处理研究聚焦于识别有限数量的离散情感类别,并常将其应用于跨语言场景。然而,这些基础集合很少针对文本数据设计,且文化、语言和方言会影响特定情感的理解方式。本研究将范围扩展至一个近乎无界的\textit{情感状态}集合,涵盖人类用于描述自身感受体验的任何术语。我们收集并发布了MASIVE数据集,包含英语和西班牙语的Reddit帖子,每种语言均涵盖超过1,000个独特的情感状态。随后,我们将语言生成模型的\textit{情感状态识别}定义为掩码跨度预测任务的新问题。在此任务中,我们发现经过微调的较小规模多语言模型优于更大型的语言模型,即使在区域特定的西班牙语情感状态识别上也是如此。此外,我们证明基于MASIVE的预训练能提升模型在现有情感基准测试上的性能。最后,通过机器翻译实验,我们发现母语者书写的数据对此任务的高性能表现至关重要。