Large Language Models (LLMs) promise to streamline software code reviews, but their ability to produce consistent assessments remains an open question. In this study, we tested four leading LLMs -- GPT-4o mini, GPT-4o, Claude 3.5 Sonnet, and LLaMA 3.2 90B Vision -- on 70 Java commits from both private and public repositories. By setting each model's temperature to zero, clearing context, and repeating the exact same prompts five times, we measured how consistently each model generated code-review assessments. Our results reveal that even with temperature minimized, LLM responses varied to different degrees. These findings highlight a consideration about the inherently limited consistency (test-retest reliability) of LLMs -- even when the temperature is set to zero -- and the need for caution when using LLM-generated code reviews to make real-world decisions.
翻译:大型语言模型(LLMs)有望简化软件代码审查流程,但其生成一致性评估的能力仍是一个悬而未决的问题。本研究在来自私有和公共代码库的70个Java提交上测试了四种领先的LLM——GPT-4o mini、GPT-4o、Claude 3.5 Sonnet和LLaMA 3.2 90B Vision。通过将每个模型的温度参数设为零、清除上下文,并重复完全相同的提示五次,我们测量了每个模型生成代码审查评估的一致性程度。我们的结果表明,即使温度参数被最小化,LLM的响应仍存在不同程度的波动。这些发现凸显了关于LLM固有的有限一致性(重测信度)的考量——即使温度设为零时亦然——并警示在利用LLM生成的代码审查做出实际决策时需要保持谨慎。