This paper presents a comprehensive comparative analysis of Large Language Models (LLMs) for generation of code documentation. Code documentation is an essential part of the software writing process. The paper evaluates models such as GPT-3.5, GPT-4, Bard, Llama2, and Starchat on various parameters like Accuracy, Completeness, Relevance, Understandability, Readability and Time Taken for different levels of code documentation. Our evaluation employs a checklist-based system to minimize subjectivity, providing a more objective assessment. We find that, barring Starchat, all LLMs consistently outperform the original documentation. Notably, closed-source models GPT-3.5, GPT-4, and Bard exhibit superior performance across various parameters compared to open-source/source-available LLMs, namely LLama 2 and StarChat. Considering the time taken for generation, GPT-4 demonstrated the longest duration, followed by Llama2, Bard, with ChatGPT and Starchat having comparable generation times. Additionally, file level documentation had a considerably worse performance across all parameters (except for time taken) as compared to inline and function level documentation.
翻译:本文对用于代码文档生成的大语言模型(LLMs)进行了全面的比较分析。代码文档是软件编写过程中不可或缺的一部分。本文评估了GPT-3.5、GPT-4、Bard、Llama2和Starchat等模型在不同代码文档层级上的准确性、完整性、相关性、可理解性、可读性和生成时间等参数。我们的评估采用基于检查清单的系统以减少主观性,从而提供更客观的评估结果。我们发现,除Starchat外,所有LLMs均持续优于原始文档。值得注意的是,与开源/源代码可用的LLMs(即LLama2和StarChat)相比,闭源模型GPT-3.5、GPT-4和Bard在多项参数上表现出更优的性能。在生成时间方面,GPT-4耗时最长,其次是Llama2和Bard,而ChatGPT和Starchat的生成时间相近。此外,与行内文档和函数级文档相比,文件级文档在所有参数(耗时除外)上的表现均明显较差。