This paper presents a comprehensive comparative analysis of Large Language Models (LLMs) for generation of code documentation. Code documentation is an essential part of the software writing process. The paper evaluates models such as GPT-3.5, GPT-4, Bard, Llama2, and Starchat on various parameters like Accuracy, Completeness, Relevance, Understandability, Readability and Time Taken for different levels of code documentation. Our evaluation employs a checklist-based system to minimize subjectivity, providing a more objective assessment. We find that, barring Starchat, all LLMs consistently outperform the original documentation. Notably, closed-source models GPT-3.5, GPT-4, and Bard exhibit superior performance across various parameters compared to open-source/source-available LLMs, namely LLama 2 and StarChat. Considering the time taken for generation, GPT-4 demonstrated the longest duration, followed by Llama2, Bard, with ChatGPT and Starchat having comparable generation times. Additionally, file level documentation had a considerably worse performance across all parameters (except for time taken) as compared to inline and function level documentation.
翻译:本文对大型语言模型(LLMs)在代码文档生成任务中进行了全面的比较分析。代码文档是软件编写过程中的关键组成部分。研究评估了GPT-3.5、GPT-4、Bard、Llama2和Starchat等模型在准确性、完整性、相关性、可理解性、可读性以及生成时间等多个参数上的表现,涵盖不同粒度的代码文档。我们采用基于检查表的评估体系以最小化主观性,从而提供更客观的评估结果。研究发现,除Starchat外,所有LLMs均持续优于原始文档。值得注意的是,闭源模型GPT-3.5、GPT-4和Bard在各项参数上均优于开源/源码可用LLMs(即LLama 2和StarChat)。在生成时间方面,GPT-4耗时最长,其次为Llama2和Bard,而ChatGPT与Starchat的生成时间相当。此外,与行内文档和函数级文档相比,文件级文档在所有参数(除生成时间外)上的表现均显著较差。