Use of figurative language, such as metaphors and idioms, is common in our daily-life communications, and it can also be found in Software Engineering (SE) channels, such as comments on GitHub. Automatically interpreting figurative language is a challenging task, even with modern Large Language Models (LLMs), as it often involves subtle nuances. This is particularly true in the SE domain, where figurative language is frequently used to convey technical concepts, often bearing developer affect (e.g., `spaghetti code'). Surprisingly, there is a lack of studies on how figurative language in SE communications impacts the performance of automatic tools that focus on understanding developer communications, e.g., bug prioritization, incivility detection. Furthermore, it is an open question to what extent state-of-the-art LLMs interpret figurative expressions in domain-specific communication such as software engineering. To address this gap, we study the prevalence and impact of figurative language in SE communication channels. This study contributes to understanding the role of figurative language in SE, the potential of LLMs in interpreting them, and its impact on automated SE communication analysis. Our results demonstrate the effectiveness of fine-tuning LLMs with figurative language in SE and its potential impact on automated tasks that involve affect. We found that, among three state-of-the-art LLMs, the best improved fine-tuned versions have an average improvement of 6.66% on a GitHub emotion classification dataset, 7.07% on a GitHub incivility classification dataset, and 3.71% on a Bugzilla bug report prioritization dataset.
翻译:隐喻和习语等比喻性语言在日常交流中十分常见,在GitHub评论等软件工程(SE)渠道中同样普遍存在。即使采用现代大语言模型(LLM),自动解析比喻性语言仍是一项艰巨任务,因其常涉及微妙语义。这在软件工程领域尤为突出——比喻性语言频繁用于传达技术概念,往往承载开发者情感(如"意大利面条式代码")。令人惊讶的是,目前鲜有研究探讨SE交流中比喻性语言对自动理解工具(如缺陷优先级排序、不文明检测)性能的影响。此外,顶尖LLM在解读领域特定交流(如软件工程)中的比喻表达时能达到何种程度仍属未解之谜。为填补这一空白,我们系统研究了SE交流渠道中比喻性语言的普遍性及其影响。本研究通过揭示比喻性语言在SE中的作用、LLM解读此类语言的能力及其对自动化SE交流分析的影响,主要发现如下:首先,针对SE领域特性微调LLM能有效提升其解析比喻性语言的能力;其次,该策略对涉及情感的自动化任务具有潜在影响。实验表明,在三种主流LLM中,经微调的最佳版本在GitHub情感分类数据集上平均提升6.66%,在GitHub不文明检测数据集上提升7.07%,在Bugzilla缺陷报告优先级排序数据集上提升3.71%。