Question answering over source code provides software engineers and project managers with helpful information about the implemented features of a software product. This paper presents a work devoted to using large language models for question answering over source code in Python. The proposed method for implementing a source code question answering system involves fine-tuning a large language model on a unified dataset of questions and answers for Python code. To achieve the highest quality answers, we tested various models trained on datasets preprocessed in different ways: a dataset without grammar correction, a dataset with grammar correction, and a dataset augmented with the generated summaries. The model answers were also analyzed for errors manually. We report BLEU-4, BERTScore F1, BLEURT, and Exact Match metric values, along with the conclusions from the manual error analysis. The obtained experimental results highlight the current problems of the research area, such as poor quality of the public genuine question-answering datasets. In addition, the findings include the positive effect of the grammar correction of the training data on the testing metric values. The addressed findings and issues could be important for other researchers who attempt to improve the quality of source code question answering solutions. The training and evaluation code is publicly available at https://github.com/IU-AES-AI4Code/CodeQuestionAnswering.
翻译:源代码问答为软件工程师和项目管理者提供了关于软件产品已实现功能的有益信息。本文致力于研究利用大型语言模型进行Python源代码问答的工作。所提出的源代码问答系统实现方法涉及在统一的Python代码问答数据集上对大型语言模型进行微调。为获得最高质量的答案,我们测试了在不同预处理方式数据集上训练的各种模型:未经语法校正的数据集、经语法校正的数据集以及通过生成摘要增强的数据集。同时通过人工方式对模型答案的错误进行了分析。我们报告了BLEU-4、BERTScore F1、BLEURT和精确匹配等指标值,以及人工错误分析的结论。所得实验结果凸显了该研究领域当前存在的问题,例如公开真实问答数据集质量欠佳。此外,研究还发现训练数据的语法校正对测试指标值具有积极影响。这些研究成果与现存问题对于其他试图提升源代码问答解决方案质量的研究者具有重要意义。训练与评估代码已公开于https://github.com/IU-AES-AI4Code/CodeQuestionAnswering。