Recent advancements in large language models (LLMs) have significantly enhanced their ability to understand both natural language and code, driving their use in tasks like natural language-to-code (NL2Code) and code summarization. However, LLMs are prone to hallucination-outputs that stray from intended meanings. Detecting hallucinations in code summarization is especially difficult due to the complex interplay between programming and natural languages. We introduce a first-of-its-kind dataset with $\sim$10K samples, curated specifically for hallucination detection in code summarization. We further propose a novel Entity Tracing Framework (ETF) that a) utilizes static program analysis to identify code entities from the program and b) uses LLMs to map and verify these entities and their intents within generated code summaries. Our experimental analysis demonstrates the effectiveness of the framework, leading to a 0.73 F1 score. This approach provides an interpretable method for detecting hallucinations by grounding entities, allowing us to evaluate summary accuracy.
翻译:近年来,大型语言模型(LLMs)在理解自然语言和代码方面的能力显著提升,推动了其在自然语言到代码(NL2Code)和代码摘要等任务中的应用。然而,LLMs容易产生幻觉——即输出偏离预期含义的内容。由于编程语言与自然语言之间复杂的相互作用,检测代码摘要中的幻觉尤为困难。我们首次引入了一个包含约10K样本的数据集,专门为代码摘要中的幻觉检测而构建。我们进一步提出了一种新颖的实体追踪框架(ETF),该框架:a) 利用静态程序分析从程序中识别代码实体;b) 使用LLMs在生成的代码摘要中映射并验证这些实体及其意图。我们的实验分析证明了该框架的有效性,取得了0.73的F1分数。该方法通过实体锚定提供了一种可解释的幻觉检测途径,使我们能够评估摘要的准确性。