揭示大语言模型对Java代码片段的真实类型推断能力 (Unmasking the Genuine Type Inference Capabilities of LLMs for Java Code Snippets)

Type inference is crucial for reusing online code snippets. Although snippets are prevalently shared on platforms like StackOverflow, they often lack essential type information, such as fully qualified names (FQNs). Recent studies have leveraged Large Language Models (LLMs) to perform type inference for such code snippets, showing promising results. However, these results may suffer from data leakage, as the benchmark, StatType-SO, used for evaluation has been publicly available on GitHub since 2017. Consequently, it remains uncertain whether the strong performance of LLMs reflects genuine semantic understanding of code or is due to the ground truth being included in the training set. This paper strives to comprehensively evaluate the genuine type inference capabilities of LLMs on Java code snippets and identify potential limitations of LLMs. First, we created ThaliaType, a new, previously unreleased benchmark suite designed for type inference evaluation. Second, using the StarCoder2 LLM as baseline, we uncovered data leakage from StatType-SO in StarCoder2's open-source training set and observed that other state-of-the-art LLMs exhibit similar performance drops when evaluated on ThaliaType, with precision decreasing by up to 59% and recall by up to 72%. Finally, we designed semantic-preserving code transformations to test the capabilities of LLMs in understanding the execution semantics of snippets. Results showed that LLMs' performance on StatType-SO is far less robust to these transformations than on ThaliaType, suggesting that the performance on StatType-SO may be biased by data leakage and have limited generalizability. These findings highlight the importance of carefully designed, leakage-free benchmarks for evaluating LLMs on type inference tasks. We recommend future studies adopt ThaliaType for rigorous and reliable assessments of LLMs' genuine type inference capabilities.

翻译：类型推断对于复用在线代码片段至关重要。尽管代码片段在StackOverflow等平台上被广泛分享，但它们通常缺乏必要的类型信息，例如完全限定名（FQNs）。近期研究利用大语言模型（LLMs）为此类代码片段执行类型推断，并显示出有前景的结果。然而，这些结果可能受到数据泄露的影响，因为用于评估的基准测试集StatType-SO自2017年起已在GitHub上公开可用。因此，LLMs的强大性能究竟是反映了对代码的真实语义理解，还是由于训练集中包含了基准答案，仍然存在不确定性。本文致力于全面评估LLMs在Java代码片段上的真实类型推断能力，并识别其潜在局限性。首先，我们创建了ThaliaType，这是一个全新的、先前未发布的、专为类型推断评估设计的基准测试套件。其次，以StarCoder2 LLM为基线，我们揭示了StatType-SO在StarCoder2开源训练集中的数据泄露问题，并观察到其他最先进的LLMs在ThaliaType上评估时也表现出类似的性能下降，精确率下降高达59%，召回率下降高达72%。最后，我们设计了语义保持的代码转换来测试LLMs理解代码片段执行语义的能力。结果表明，与在ThaliaType上相比，LLMs在StatType-SO上的性能对这些转换的鲁棒性要差得多，这表明其在StatType-SO上的性能可能受到数据泄露的干扰，并且泛化能力有限。这些发现强调了精心设计、无数据泄露的基准测试对于评估LLMs在类型推断任务上的表现至关重要。我们建议未来的研究采用ThaliaType来对大语言模型的真实类型推断能力进行严格且可靠的评估。