Evaluation of large language models for assessing code maintainability

Increased availability of open-source software repositories and recent advances in code analysis using large language models (LLMs) has triggered a wave of new work to automate software engineering tasks that were previously very difficult to automate. In this paper, we investigate a recent line of work that hypothesises that comparing the probability of code generated by LLMs with the probability the current code would have had can indicate potential quality problems. We investigate the association between the cross-entropy of code generated by ten different models (based on GPT2 and Llama2) and the following quality aspects: readability, understandability, complexity, modularisation, and overall maintainability assessed by experts and available in an benchmark dataset. Our results show that, controlling for the number of logical lines of codes (LLOC), cross-entropy computed by LLMs is indeed a predictor of maintainability on a class level (the higher the cross-entropy the lower the maintainability). However, this relation is reversed when one does not control for LLOC (e.g., comparing small classes with longer ones). Furthermore, while the complexity of LLMs affects the range of cross-entropy (smaller models tend to have a wider range of cross-entropy), this plays a significant role in predicting maintainability aspects. Our study limits itself on ten different pretrained models (based on GPT2 and Llama2) and on maintainability aspects collected by Schnappinger et al. When controlling for logical lines of code (LLOC), cross-entropy is a predictor of maintainability. However, while related work has shown the potential usefulness of cross-entropy at the level of tokens or short sequences, at the class level this criterion alone may prove insufficient to predict maintainability and further research is needed to make best use of this information in practice.

翻译：开源软件仓库的日益丰富以及近期利用大型语言模型进行代码分析的技术进展，引发了一股自动化软件工程任务的新浪潮，这些任务此前极难实现自动化。本文针对近期一项研究假设进行探究，该假设认为，对比大型语言模型生成代码的概率与现有代码本应具有的概率，可揭示潜在的质量问题。我们研究了基于 GPT2 和 Llama2 的十种不同模型生成的代码交叉熵与以下质量属性之间的关联：可读性、可理解性、复杂度、模块化程度，以及由专家评估并收录于基准数据集中的整体可维护性。结果表明，在控制逻辑代码行数的情况下，大型语言模型计算的交叉熵确实可以作为类级别可维护性的预测指标（交叉熵越高，可维护性越低）。然而，当不控制逻辑代码行数时（例如，比较短类与长类），这种关系会逆转。此外，尽管大型语言模型的复杂度影响交叉熵的范围（较小模型倾向于产生更宽的交叉熵范围），但这在预测可维护性方面仍具有显著作用。本研究局限于十种不同的预训练模型（基于 GPT2 和 Llama2）以及 Schnappinger 等人收集的可维护性属性。在控制逻辑代码行数时，交叉熵可作为可维护性的预测指标。然而，尽管已有研究显示交叉熵在标记或短序列层面具有潜在实用性，但在类级别上，仅凭这一标准可能不足以预测可维护性，未来需进一步研究以在实践中充分利用这一信息。