Software development is an inherently collaborative process, where various stakeholders frequently express their opinions and emotions across diverse platforms. Recognizing the sentiments conveyed in these interactions is crucial for the effective development and ongoing maintenance of software systems. Over the years, many tools have been proposed to aid in sentiment analysis, but accurately identifying the sentiments expressed in software engineering datasets remains challenging. Although fine-tuned smaller large language models (sLLMs) have shown potential in handling software engineering tasks, they struggle with the shortage of labeled data. With the emergence of bigger large language models (bLLMs), it is pertinent to investigate whether they can handle this challenge in the context of sentiment analysis for software engineering. In this work, we undertake a comprehensive empirical study using five established datasets. We assess the performance of three open-source bLLMs in both zero-shot and few-shot scenarios. Additionally, we compare them with fine-tuned sLLMs. Our experimental findings demonstrate that bLLMs exhibit state-of-the-art performance on datasets marked by limited training data and imbalanced distributions. bLLMs can also achieve excellent performance under a zero-shot setting. However, when ample training data is available or the dataset exhibits a more balanced distribution, fine-tuned sLLMs can still achieve superior results.
翻译:软件开发本质上是一个协作过程,各种利益相关者经常在不同平台上表达他们的观点和情感。识别这些互动中所传达的情感对于软件系统的有效开发和持续维护至关重要。多年来,已提出许多工具来辅助情感分析,但准确识别软件工程数据集中表达的情感仍然具有挑战性。尽管微调后的小型大型语言模型在处理软件工程任务方面显示出潜力,但它们难以应对标注数据短缺的问题。随着更大的大型语言模型的出现,探究它们能否在软件工程情感分析背景下应对这一挑战具有现实意义。在本工作中,我们使用五个已建立的数据集进行了全面的实证研究。我们评估了三款开源更大的大型语言模型在零样本和少样本场景下的性能。此外,我们将其与微调后的小型大型语言模型进行了比较。实验结果表明,在训练数据有限且分布不平衡的数据集上,更大的大型语言模型展现出最先进的性能。更大的大型语言模型在零样本设置下也能取得优异表现。然而,当有充足的训练数据或数据集呈现更平衡的分布时,微调后的小型大型语言模型仍能获得更优越的结果。