As developers increasingly rely on LLM-generated code summaries for documentation, testing, and review, it is important to study whether these summaries accurately reflect what the program actually does. LLMs often produce confident descriptions of what the code looks like it should do (intent), while missing subtle edge cases or logic changes that define what it actually does (behavior). We present a mutation-based evaluation methodology that directly tests whether a summary truly matches the code's logic. Our approach generates a summary, injects a targeted mutation into the code, and checks if the LLM updates its summary to reflect the new behavior. We validate it through three experiments totalling 624 mutation-summary evaluations across 62 programs. First, on 12 controlled synthetic programs with 324 mutations varying in type (statement, value, decision) and location (beginning, middle, end). We find that summary accuracy decreases sharply with complexity from 76.5% for single functions to 17.3% for multi-threaded systems, while mutation type and location exhibit weaker effects. Second, testing 150 mutated samples on 50 human-written programs from the Less Basic Python Problems (LBPP) dataset confirms the same failure patterns persist as models often describe algorithmic intent rather than actual mutated behavior with a summary accuracy rate of 49.3%. Furthermore, while a comparison between GPT-4 and GPT-5.2 shows a substantial performance leap (from 49.3% to 85.3%) and an improved ability to identify mutations as "bugs", both models continue to struggle with distinguishing implementation details from standard algorithmic patterns. This work establishes mutation analysis as a systematic approach for assessing whether LLM-generated summaries reflect program behavior rather than superficial textual patterns.
翻译:随着开发者日益依赖大语言模型生成的代码摘要进行文档编写、测试和评审,研究这些摘要是否准确反映程序实际行为变得至关重要。大语言模型常能自信描述代码看似应有的功能(意图),却往往忽略决定其实际行为(表现)的微妙边界情况或逻辑变更。本文提出一种基于变异的评估方法,直接检验摘要是否真实匹配代码逻辑。该方法首先生成摘要,随后在代码中注入定向变异,最后检测大语言模型是否会更新摘要以反映新行为。我们通过总计624次变异-摘要评估的三组实验(涉及62个程序)验证该方法。首先,在12个受控合成程序上进行324次变异测试,涵盖不同类型(语句、数值、判断)和位置(起始、中间、结尾)。研究发现摘要准确率随复杂度急剧下降:从单函数的76.5%降至多线程系统的17.3%,而变异类型与位置的影响较弱。其次,在Less Basic Python Problems(LBPP)数据集的50个人工编写程序上测试150个变异样本,证实相同失效模式持续存在——模型常描述算法意图而非实际变异行为,摘要准确率为49.3%。此外,虽然GPT-4与GPT-5.2的对比显示性能大幅跃升(从49.3%至85.3%)及识别变异为“缺陷”的能力提升,但两种模型仍难以区分实现细节与标准算法模式。本研究确立了变异分析作为系统化评估方法的地位,可用于检验大语言模型生成的摘要是否反映程序行为而非表面文本模式。