In the field of emotion analysis, much NLP research focuses on identifying a limited number of discrete emotion categories, often applied across languages. These basic sets, however, are rarely designed with textual data in mind, and culture, language, and dialect can influence how particular emotions are interpreted. In this work, we broaden our scope to a practically unbounded set of \textit{affective states}, which includes any terms that humans use to describe their experiences of feeling. We collect and publish MASIVE, a dataset of Reddit posts in English and Spanish containing over 1,000 unique affective states each. We then define the new problem of \textit{affective state identification} for language generation models framed as a masked span prediction task. On this task, we find that smaller finetuned multilingual models outperform much larger LLMs, even on region-specific Spanish affective states. Additionally, we show that pretraining on MASIVE improves model performance on existing emotion benchmarks. Finally, through machine translation experiments, we find that native speaker-written data is vital to good performance on this task.
翻译:在情感分析领域,许多自然语言处理研究专注于识别有限数量的离散情感类别,并通常跨语言应用。然而,这些基础集合很少是针对文本数据设计的,并且文化、语言和方言会影响特定情感的解释方式。在这项工作中,我们将研究范围扩展到一个几乎无界的\textit{情感状态}集合,该集合包括人类用来描述其感受体验的任何术语。我们收集并发布了MASIVE数据集,这是一个包含英语和西班牙语Reddit帖子的数据集,每种语言包含超过1000个独特的情感状态。随后,我们将语言生成模型的\textit{情感状态识别}定义为一个新的问题,并将其构建为掩码跨度预测任务。在此任务中,我们发现较小的微调多语言模型优于更大的大型语言模型,即使在特定区域的西班牙语情感状态识别上也是如此。此外,我们证明在MASIVE上进行预训练可以提高模型在现有情感基准测试上的性能。最后,通过机器翻译实验,我们发现由母语者撰写的数据对于此任务的良好性能至关重要。