Understanding and identifying the causes behind developers' emotions (e.g., Frustration caused by `delays in merging pull requests') can be crucial towards finding solutions to problems and fostering collaboration in open-source communities. Effectively identifying such information in the high volume of communications across the different project channels, such as chats, emails, and issue comments, requires automated recognition of emotions and their causes. To enable this automation, large-scale software engineering-specific datasets that can be used to train accurate machine learning models are required. However, such datasets are expensive to create with the variety and informal nature of software projects' communication channels. In this paper, we explore zero-shot LLMs that are pre-trained on massive datasets but without being fine-tuned specifically for the task of detecting emotion causes in software engineering: ChatGPT, GPT-4, and flan-alpaca. Our evaluation indicates that these recently available models can identify emotion categories when given detailed emotions, although they perform worse than the top-rated models. For emotion cause identification, our results indicate that zero-shot LLMs are effective at recognizing the correct emotion cause with a BLEU-2 score of 0.598. To highlight the potential use of these techniques, we conduct a case study of the causes of Frustration in the last year of development of a popular open-source project, revealing several interesting insights.
翻译:理解并识别开发者情绪背后的原因(例如,“合并拉取请求的延迟”导致的挫败感)对于在开源社区中解决问题和促进协作至关重要。在各类项目渠道(如聊天、邮件和问题评论)的大量通信中有效识别此类信息,需要自动化地识别情绪及其原因。为实现这一自动化,需要能够用于训练精确机器学习模型的大规模软件工程专用数据集。然而,由于软件项目通信渠道的多样性和非正式性,创建此类数据集的成本高昂。本文探索了在大规模数据集上预训练但未针对检测软件工程中情绪原因任务进行微调的零样本大语言模型:ChatGPT、GPT-4 和 flan-alpaca。我们的评估表明,这些近期可用的模型在给出详细情绪时能够识别情绪类别,尽管其表现逊于顶级模型。在情绪原因识别方面,我们的结果表明,零样本大语言模型能够有效识别正确的情绪原因,BLEU-2 得分为 0.598。为强调这些技术的潜在用途,我们针对一个流行开源项目过去一年开发中的挫败感原因进行了案例研究,揭示了若干有趣的见解。