Use of figurative language, such as metaphors and idioms, is common in our daily-life communications, and it can also be found in Software Engineering (SE) channels, such as comments on GitHub. Automatically interpreting figurative language is a challenging task, even with modern Large Language Models (LLMs), as it often involves subtle nuances. This is particularly true in the SE domain, where figurative language is frequently used to convey technical concepts, often bearing developer affect (e.g., `spaghetti code'). Surprisingly, there is a lack of studies on how figurative language in SE communications impacts the performance of automatic tools that focus on understanding developer communications, e.g., bug prioritization, incivility detection. Furthermore, it is an open question to what extent state-of-the-art LLMs interpret figurative expressions in domain-specific communication such as software engineering. To address this gap, we study the prevalence and impact of figurative language in SE communication channels. This study contributes to understanding the role of figurative language in SE, the potential of LLMs in interpreting them, and its impact on automated SE communication analysis. Our results demonstrate the effectiveness of fine-tuning LLMs with figurative language in SE and its potential impact on automated tasks that involve affect. We found that, among three state-of-the-art LLMs, the best improved fine-tuned versions have an average improvement of 6.66% on a GitHub emotion classification dataset, 7.07% on a GitHub incivility classification dataset, and 3.71% on a Bugzilla bug report prioritization dataset.
翻译:隐喻、习语等修辞性语言在日常交流中普遍存在,同样也出现在软件工程(SE)渠道(如GitHub评论)中。即便是现代大语言模型(LLMs),自动解读修辞性语言仍具有挑战性,因其常涉及微妙语义内涵。这一挑战在软件工程领域尤为突出——该领域的修辞性语言常被用于传达技术概念,并承载开发者情感倾向(如"意大利面条式代码")。令人惊讶的是,目前尚缺乏关于软件工程交流中修辞性语言如何影响自动化工具(如缺陷优先级排序、不文明语言检测)性能的研究。此外,最先进的LLMs能在多大程度上解析软件工程这类领域特定交流中的修辞表达,仍是一个开放性问题。为填补这一空白,我们研究了软件工程交流渠道中修辞性语言的普遍性及其影响。本研究有助于理解修辞性语言在软件工程中的作用、LLMs解析此类语言的潜力,以及其对自动化软件工程交流分析的影响。我们的研究结果表明,在软件工程领域使用修辞性语言进行微调LLMs的有效性,及其对涉及情感因素的自动化任务的潜在影响。我们发现,在三个最先进的LLMs中,经最佳改进的微调版本在GitHub情感分类数据集上平均提升6.66%,在GitHub不文明语言分类数据集上平均提升7.07%,在Bugzilla缺陷报告优先级排序数据集上平均提升3.71%。