测量大型语言模型在软件代码审查中的确定性 (Measuring Determinism in Large Language Models for Software Code Review)

Large Language Models (LLMs) promise to streamline software code reviews, but their ability to produce consistent assessments remains an open question. In this study, we tested four leading LLMs -- GPT-4o mini, GPT-4o, Claude 3.5 Sonnet, and LLaMA 3.2 90B Vision -- on 70 Java commits from both private and public repositories. By setting each model's temperature to zero, clearing context, and repeating the exact same prompts five times, we measured how consistently each model generated code-review assessments. Our results reveal that even with temperature minimized, LLM responses varied to different degrees. These findings highlight a consideration about the inherently limited consistency (test-retest reliability) of LLMs -- even when the temperature is set to zero -- and the need for caution when using LLM-generated code reviews to make real-world decisions.

翻译：大型语言模型（LLMs）有望简化软件代码审查流程，但其生成一致性评估的能力仍是一个悬而未决的问题。本研究在来自私有和公共代码库的70个Java提交上测试了四种领先的LLM——GPT-4o mini、GPT-4o、Claude 3.5 Sonnet和LLaMA 3.2 90B Vision。通过将每个模型的温度参数设为零、清除上下文，并重复完全相同的提示五次，我们测量了每个模型生成代码审查评估的一致性程度。我们的结果表明，即使温度参数被最小化，LLM的响应仍存在不同程度的波动。这些发现凸显了关于LLM固有的有限一致性（重测信度）的考量——即使温度设为零时亦然——并警示在利用LLM生成的代码审查做出实际决策时需要保持谨慎。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日