Large Language Models (LLMs) have shown impressive abilities in code generation, but they may generate erroneous programs. Reading a program takes ten times longer than writing it. Showing these erroneous programs to developers will waste developers' energies and introduce security risks to software. To address the above limitations, we propose HonestCoder, a novel LLM-based code generation approach. HonestCoder selectively shows the generated programs to developers based on LLMs' confidence. The confidence provides valuable insights into the correctness of generated programs. To achieve this goal, we propose a novel approach to estimate LLMs' confidence in code generation. It estimates confidence by measuring the multi-modal similarity between LLMs-generated programs. We collect and release a multilingual benchmark named TruthCodeBench, which consists of 2,265 samples and covers two popular programming languages (i.e., Python and Java). We apply HonestCoder to four popular LLMs (e.g., DeepSeek-Coder and Code Llama) and evaluate it on TruthCodeBench. Based on the experiments, we obtain the following insights. (1) HonestCoder can effectively estimate LLMs' confidence and accurately determine the correctness of generated programs. For example, HonestCoder outperforms the state-of-the-art baseline by 27.79% in AUROC and 63.74% in AUCPR. (2) HonestCoder can decrease the number of erroneous programs shown to developers. Compared to eight baselines, it can show more correct programs and fewer erroneous programs to developers. (3) Compared to showing code indiscriminately, HonestCoder only adds slight time overhead (approximately 0.4 seconds per requirement). (4) We discuss future directions to facilitate the application of LLMs in software development. We hope this work can motivate broad discussions about measuring the reliability of LLMs' outputs in performing code-related tasks.
翻译:大型语言模型(LLM)在代码生成方面展现出卓越能力,但可能生成错误程序。阅读程序所需时间是编写程序的十倍。向开发者展示这些错误程序将浪费其精力并给软件带来安全风险。为应对上述局限,我们提出HonestCoder——一种基于LLM的新型代码生成方法。该方法根据LLM的置信度选择性向开发者展示生成程序,该置信度为生成程序的正确性提供重要参考。为实现此目标,我们提出一种评估LLM代码生成置信度的新方法,通过测量LLM生成程序间的多模态相似性来估算置信度。我们收集并发布名为TruthCodeBench的多语言基准数据集,包含2,265个样本,涵盖Python和Java两种流行编程语言。我们将HonestCoder应用于四种主流LLM(如DeepSeek-Coder和Code Llama),并在TruthCodeBench上进行评估。实验结果表明:(1)HonestCoder能有效评估LLM置信度并准确判定生成程序的正确性,例如在AUROC和AUCPR指标上分别超越现有最优基线27.79%和63.74%;(2)HonestCoder可减少向开发者展示的错误程序数量,相比八种基线方法,能展示更多正确程序及更少错误程序;(3)与无差别展示代码相比,HonestCoder仅增加轻微时间开销(约每个需求0.4秒);(4)我们探讨了促进LLM在软件开发中应用的未来方向。本研究期望推动关于衡量LLM在执行代码相关任务时输出可靠性的广泛讨论。