Large Language models (LLMs) have shown strong capabilities in code review automation, such as review comment generation, yet they suffer from hallucinations -- where the generated review comments are ungrounded in the actual code -- poses a significant challenge to the adoption of LLMs in code review workflows. To address this, we explore effective and scalable methods for a hallucination detection in LLM-generated code review comments without the reference. In this work, we design HalluJudge that aims to assess the grounding of generated review comments based on the context alignment. HalluJudge includes four key strategies ranging from direct assessment to structured multi-branch reasoning (e.g., Tree-of-Thoughts). We conduct a comprehensive evaluation of these assessment strategies across Atlassian's enterprise-scale software projects to examine the effectiveness and cost-efficiency of HalluJudge. Furthermore, we analyze the alignment between HalluJudge's judgment and developer preference of the actual LLM-generated code review comments in the real-world production. Our results show that the hallucination assessment in HalluJudge is cost-effective with an F1 score of 0.85 and an average cost of $0.009. On average, 67% of the HalluJudge assessments are aligned with the developer preference of the actual LLM-generated review comments in the online production. Our results suggest that HalluJudge can serve as a practical safeguard to reduce developers' exposure to hallucinated comments, fostering trust in AI-assisted code reviews.
翻译:大型语言模型在代码审查自动化(如审查评论生成)中展现出强大能力,但其生成的审查评论常出现与真实代码缺乏依据的"幻觉"现象——这严重阻碍了LLM在代码审查工作流中的实际应用。为解决该问题,我们探索了无需参考标准即可有效检测LLM生成代码审查评论中幻觉的可扩展方法。本文设计了HalluJudge框架,旨在基于上下文对齐度评估生成评论的可靠性。该框架包含四种关键策略,涵盖从直接评估到结构化多分支推理(如思维树)的方法。通过对Atlassian企业级软件项目进行综合评估,我们验证了HalluJudge各评估策略的有效性与成本效益。此外,我们分析了HalluJudge判断结果与开发者对实际LLM生成代码审查评论偏好之间的对齐程度。实验结果表明,HalluJudge的幻觉评估具有成本效益优势(F1分数达0.85,平均评估成本为0.009美元),且在线生产环境中平均67%的评估结果与开发者对实际LLM生成审查评论的偏好一致。研究证实,HalluJudge可作为降低开发者接触幻觉评论风险的有效保障机制,进而增强对AI辅助代码审查的信任度。