Beyond Memorization: Evaluating the True Type Inference Capabilities of LLMs for Java Code Snippets

Type inference is a crucial task for reusing online code snippets, often found on platforms like StackOverflow, which frequently lack essential type information such as fully qualified names (FQNs) and required libraries. Recent studies have leveraged Large Language Models (LLMs) for type inference on code snippets, showing promising results. However, these results are potentially affected by data leakage, as the benchmark suite (StatType-SO) has been public on GitHub since 2017 (full suite in 2023). Thus, it is uncertain whether LLMs' strong performance reflects genuine code semantics understanding or a mere retrieval of ground truth from training data. To comprehensively assess LLMs' type inference capabilities on Java code snippets, we conducted a three-pronged evaluation. First, utilizing Thalia, a program synthesis technique, we created ThaliaType--a new, unseen dataset for type inference evaluation. On unseen snippets, LLM performance dropped significantly, with up to a 59% decrease in precision and 72% in recall. Second, we developed semantic-preserving transformations that significantly degraded LLMs' type inference performance, revealing weaknesses in understanding code semantics. Third, we used delta debugging to identify the minimal syntax elements sufficient for LLM inference. While type inference primarily involves inferring FQNs for types in the code snippet, LLMs correctly infer FQNs even when the types were absent from the snippets, suggesting a reliance on knowledge from training instead of thoroughly analyzing the snippets. Our findings indicate that LLMs' strong past performance likely stemmed from data leakage, rather than a genuine understanding of the semantics of code snippets. Our findings highlight the crucial need for carefully designed benchmarks using unseen code snippets to assess the true capabilities of LLMs for type inference tasks.

翻译：类型推断是复用在线代码片段（常见于StackOverflow等平台）的关键任务，这些片段常缺乏必要的类型信息（如完全限定名和所需库）。近期研究利用大语言模型对代码片段进行类型推断，显示出良好效果。然而这些结果可能受到数据泄露的影响，因为基准测试集（StatType-SO）自2017年起已在GitHub公开（完整测试集于2023年发布）。因此，大语言模型的优异表现究竟反映真实的代码语义理解能力，还是仅从训练数据中检索标准答案，尚不明确。为全面评估大语言模型对Java代码片段的类型推断能力，我们开展了三阶段评估。首先，利用程序合成技术Thalia构建了ThaliaType——一个全新的、未见过的类型推断评估数据集。在未见片段上，大语言模型性能显著下降，精确率最大降幅达59%，召回率降幅达72%。其次，我们设计了语义保持变换方法，能显著降低大语言模型的类型推断性能，揭示其在代码语义理解方面的缺陷。第三，我们采用Delta调试技术识别出足以支撑大语言模型推断的最小语法元素。虽然类型推断主要涉及推断代码片段中类型的完全限定名，但大语言模型甚至在类型信息缺失时仍能正确推断完全限定名，这表明其依赖训练获得的知识而非深入分析代码片段。我们的研究结果表明，大语言模型过往的优异表现很可能源于数据泄露，而非对代码片段语义的真正理解。本研究强调必须精心设计使用未见代码片段的基准测试，以准确评估大语言模型在类型推断任务中的真实能力。