Revisiting Sentiment Analysis for Software Engineering in the Era of Large Language Models

Software development is an inherently collaborative process, where various stakeholders frequently express their opinions and emotions across diverse platforms. Recognizing the sentiments conveyed in these interactions is crucial for the effective development and ongoing maintenance of software systems. For instance, app developers can harness sentiment analysis of app users' reviews to enhance the quality of their app. Over the years, many tools have been proposed to aid in sentiment analysis, but accurately identifying the sentiments expressed in software engineering datasets remains challenging. Recent advances have showcased the potential of fine-tuned pre-trained language models in handling software engineering datasets, albeit they grapple with the shortage of labeled data. With the emergence of large language models (LLMs), it is pertinent to investigate how these models perform in the context of sentiment analysis for software engineering. In this work, we undertake a comprehensive empirical study using five established software engineering datasets. We assess the performance of three open-source LLMs in both zero-shot and few-shot scenarios. Additionally, we draw comparisons between fine-tuned pre-trained smaller language models and LLMs employing prompts. Our experimental findings demonstrate that LLMs exhibit state-of-the-art performance on datasets marked by limited training data and imbalanced distributions. LLMs can also achieve excellent performance under a zero-shot setting. However, when ample training data is available, or the dataset exhibits a more balanced distribution, fine-tuned smaller language models can still achieve superior results.

翻译：软件开发本质上是一个协作过程，不同利益相关者经常在各类平台上表达观点和情感。识别这些交互中所传达的情感对于软件系统的有效开发和持续维护至关重要。例如，应用开发者可以通过分析用户评论中的情感来提升应用质量。多年来，学界提出了多种工具有助于情感分析，但准确识别软件工程数据集中表达的情感仍具挑战性。近年来的进展展示了微调预训练语言模型在处理软件工程数据集方面的潜力，尽管这些模型仍面临标注数据匮乏的问题。随着大型语言模型的出现，探究这些模型在软件工程情感分析中的表现具有现实意义。本研究基于五个公认的软件工程数据集开展了全面实证研究，评估了三款开源大型语言模型在零样本和少样本场景下的性能，并通过提示工程对比了微调预训练小模型与大型语言模型的表现。实验结果表明：大型语言模型在训练数据有限且分布不均衡的数据集上展现出最优性能，并能在零样本设置下取得优异效果；但在训练数据充足或数据集分布相对均衡时，微调后的小模型仍可达成更优结果。