PELLI: Framework to effectively integrate LLMs for quality software generation

Recent studies have revealed that when LLMs are appropriately prompted and configured, they demonstrate mixed results. Such results often meet or exceed the baseline performance. However, these comparisons have two primary issues. First, they mostly considered only reliability as a comparison metric and selected a few LLMs (such as Codex and ChatGPT) for comparision. This paper proposes a comprehensive code quality assessment framework called Programmatic Excellence via LLM Iteration (PELLI). PELLI is an iterative analysis-based process that upholds high-quality code changes. We extended the state-of-the-art by performing a comprehensive evaluation that generates quantitative metrics for analyzing three primary nonfunctional requirements (such as maintainability, performance, and reliability) while selecting five popular LLMs. For PELLI's applicability, we selected three application domains while following Python coding standards. Following this framework, practitioners can ensure harmonious integration between LLMs and human developers, ensuring that their potential is fully realized. PELLI can serve as a practical guide for developers aiming to leverage LLMs while adhering to recognized quality standards. This study's outcomes are crucial for advancing LLM technologies in real-world applications, providing stakeholders with a clear understanding of where these LLMs excel and where they require further refinement. Overall, based on three nonfunctional requirements, we have found that GPT-4T and Gemini performed slightly better. We also found that prompt design can influence the overall code quality. In addition, each application domain demonstrated high and low scores across various metrics, and even within the same metrics across different prompts.

翻译：近期研究表明，当大语言模型（LLM）经过适当提示与配置后，其表现呈现差异化结果。此类结果往往达到或超越基线性能。然而，现有比较研究存在两个主要问题：其一，多数研究仅将可靠性作为比较指标；其二，通常仅选取少数LLM（如Codex与ChatGPT）进行比较。本文提出一个名为“基于LLM迭代的程序卓越性”（PELLI）的综合性代码质量评估框架。PELLI是一种基于迭代分析的过程，旨在保障高质量的代码变更。本研究通过开展全面评估拓展了现有前沿工作：在选取五种主流LLM的同时，针对三个主要非功能性需求（如可维护性、性能与可靠性）生成量化分析指标。为验证PELLI的适用性，我们选取三个应用领域并遵循Python编码规范。通过该框架，从业者可确保LLM与人类开发者之间的协同融合，充分发挥其潜力。PELLI可作为开发人员在遵循公认质量标准前提下运用LLM的实践指南。本研究结果对推进LLM技术在实际应用中的发展至关重要，能为利益相关者清晰揭示这些模型的优势领域及待改进方向。总体而言，基于三项非功能性需求的评估显示，GPT-4T与Gemini表现略优。同时发现提示设计会影响整体代码质量。此外，各应用领域在不同指标上呈现高低分异，即使相同指标在不同提示下也存在差异。