Large Language Models (LLMs) have demonstrated remarkable capabilities in code understanding and generation. However, their effectiveness on non-code Software Engineering (SE) tasks remains underexplored. We present 'Software Engineering Language Understanding' (SELU), the first comprehensive benchmark for evaluating LLMs on 22 SE textual artifacts NLU tasks, spanning from identifying whether a requirement is functional or non-functional to estimating the effort required to implement a development task. SELU covers classification, regression, Named Entity Recognition (NER), and Masked Language Modeling (MLM) tasks, with data drawn from diverse sources such as issue tracking systems and developer forums. We fine-tune 22 open-source LLMs, both generalist and domain-adapted; and prompt two proprietary alternatives using zero-shot a 3-shot prompting strategies. Performance is measured using metrics such as F1-macro, SMAPE, F1-micro, and accuracy, and compared via the Bayesian signed-rank test. Our results show that fine-tuned models across various sizes and architectures perform best, exhibiting high mean performance and low across-task variance. Furthermore, domain adaptation via code-focused pre-training does not yield significant improvements and might even be counterproductive for developer communication tasks.
翻译:大型语言模型(LLMs)在代码理解和生成方面展现出卓越能力。然而,其在非代码软件工程(SE)任务上的有效性仍未得到充分探索。本文提出“软件工程语言理解”(SELU)——首个用于评估LLMs在22项SE文本制品自然语言理解任务上的综合性基准,涵盖从识别需求属于功能性或非功能性,到估算开发任务实施所需工作量等多种任务。SELU包含分类、回归、命名实体识别(NER)和掩码语言建模(MLM)任务,数据来源于问题跟踪系统和开发者论坛等多种渠道。我们对22个开源LLM(包括通用模型和领域适应模型)进行微调,并采用零样本至三样本提示策略测试两种专有模型。性能通过F1-macro、SMAPE、F1-micro和准确率等指标衡量,并采用贝叶斯符号秩检验进行对比。结果表明,经过微调的不同规模与架构模型表现最佳,展现出较高的平均性能和较低的任务间方差。此外,通过代码导向预训练进行的领域适应并未带来显著改进,对于开发者沟通任务甚至可能产生负面效果。