Large language models (LLMs) have significantly improved their ability to perform tasks in the field of code generation. However, there is still a gap between LLMs being capable coders and being top-tier software engineers. Based on the observation that top-level software engineers often ask clarifying questions to reduce ambiguity in both requirements and coding solutions, we argue that the same should be applied to LLMs for code generation tasks. In this work, we conducted an empirical study on the benchmark and analysis of the communication skills of LLMs for code generation. We define communication skills of LLMs as ``being able to ask clarifying questions when the description of the code generation problem has issues''. We created a new benchmark, HumanEvalComm, by modifying problem descriptions according to three issues: inconsistency, ambiguity, incompleteness. We defined new evaluation metrics such as Communication Rate and Good Question Rate, and then experimented on HumanEvalComm with different Code LLMs, and a new LLM agent approach, Okanagan, to identify and ask questions in ambiguous parts from code and descriptions for further refining the generated code. Finally, we discussed evaluation results by comparing Code LLMs and Okanagan with our findings.
翻译:大语言模型(LLMs)在代码生成领域的任务执行能力已显著提升。然而,从能够编写代码到成为顶尖软件工程师,大语言模型仍存在差距。基于顶尖软件工程师常通过提出澄清性问题以减少需求与编码方案中模糊性的观察,我们认为在代码生成任务中也应对大语言模型应用相同要求。本研究针对代码生成中大语言模型的沟通能力进行了基准测试与实证分析。我们将大语言模型的沟通能力定义为“在代码生成问题描述存在缺陷时能够提出澄清性问题”。通过依据三类缺陷(不一致性、模糊性、不完整性)修改问题描述,我们创建了新的基准测试集HumanEvalComm。我们定义了沟通率与优质问题率等新评估指标,并在HumanEvalComm上对不同代码大语言模型及新型大语言模型智能体Okanagan进行了实验——该智能体能够从代码与描述中识别模糊部分并提问,以进一步优化生成代码。最后,我们通过对比代码大语言模型与Okanagan的评估结果,对研究发现进行了讨论。