The widespread adoption of automatic sentiment and emotion classifiers makes it important to ensure that these tools perform reliably across different populations. Yet their reliability is typically assessed using benchmarks that rely on third-party annotators rather than the individuals experiencing the emotions themselves, potentially concealing systematic biases. In this paper, we use a unique, large-scale dataset of more than one million self-annotated posts and a pre-registered research design to investigate gender biases in emotion detection across 414 combinations of models and emotion-related classes. We find that across different types of automatic classifiers and various underlying emotions, error rates are consistently higher for texts authored by men compared to those authored by women. We quantify how this bias could affect results in downstream applications and show that current machine learning tools, including large language models, should be applied with caution when the gender composition of a sample is not known or variable. Our findings demonstrate that sentiment analysis is not yet a solved problem, especially in ensuring equitable model behaviour across demographic groups.
翻译:自动情感与情绪分类器的广泛采用,使得确保这些工具在不同群体中可靠运行变得至关重要。然而,其可靠性通常依赖于第三方标注者而非情绪体验者本人构建的基准进行评估,这可能掩盖了系统性偏差。本文利用一个独特的、包含超过一百万条自我标注帖子的大规模数据集,以及一项预先注册的研究设计,对414种模型与情绪相关类别组合中的性别偏差进行了研究。我们发现,在不同类型的自动分类器及各种基础情绪中,男性作者文本的错误率始终高于女性作者文本。我们量化了这种偏差如何影响下游应用的结果,并表明当样本的性别构成未知或可变时,当前包括大语言模型在内的机器学习工具应谨慎使用。我们的研究结果表明,情感分析尚未成为一个已解决的问题,尤其是在确保模型在不同人口群体间的公平行为方面。