Large Language models (LLMs) have shown strong capabilities in code review automation, such as review comment generation, yet they suffer from hallucinations -- where the generated review comments are ungrounded in the actual code -- poses a significant challenge to the adoption of LLMs in code review workflows. To address this, we explore effective and scalable methods for a hallucination detection in LLM-generated code review comments without the reference. In this work, we design HalluJudge that aims to assess the grounding of generated review comments based on the context alignment. HalluJudge includes four key strategies ranging from direct assessment to structured multi-branch reasoning (e.g., Tree-of-Thoughts). We conduct a comprehensive evaluation of these assessment strategies across Atlassian's enterprise-scale software projects to examine the effectiveness and cost-efficiency of HalluJudge. Furthermore, we analyze the alignment between HalluJudge's judgment and developer preference of the actual LLM-generated code review comments in the real-world production. Our results show that the hallucination assessment in HalluJudge is cost-effective with an F1 score of 0.85 and an average cost of $0.009. On average, 67% of the HalluJudge assessments are aligned with the developer preference of the actual LLM-generated review comments in the online production. Our results suggest that HalluJudge can serve as a practical safeguard to reduce developers' exposure to hallucinated comments, fostering trust in AI-assisted code reviews.
翻译:大型语言模型(LLM)在代码审查自动化(如审查意见生成)中展现出强大能力,但其存在的幻觉问题——即生成的审查意见缺乏实际代码依据——严重阻碍了LLM在代码审查工作流中的应用。为解决该问题,本研究探索无需参考依据即可有效检测LLM生成代码审查意见中幻觉的可扩展方法。我们设计了HalluJudge系统,旨在通过上下文对齐性评估生成审查意见的 grounding 程度。该系统涵盖从直接评估到结构化多分支推理(如思维树)的四类核心策略。我们在Atlassian企业级软件项目中对这些评估策略进行了全面评测,以检验HalluJudge的有效性与成本效益。此外,我们分析了HalluJudge的判定结果与真实生产环境中开发者对LLM生成代码审查意见偏好的对齐程度。实验结果表明:HalluJudge的幻觉检测具有成本效益,F1分数达0.85,平均检测成本为0.009美元;平均67%的HalluJudge判定结果与线上生产环境中开发者对实际LLM生成审查意见的偏好保持一致。我们的研究证明,HalluJudge可作为实用防护机制,减少开发者接触幻觉评论的机会,从而增强对AI辅助代码审查的信任度。