通过上下文学习将代码结构知识后融入大型语言模型用于代码翻译 (Post-Incorporating Code Structural Knowledge into LLMs via In-Context Learning for Code Translation)

Code translation migrates codebases across programming languages. Recently, large language models (LLMs) have achieved significant advancements in software mining. However, handling the syntactic structure of source code remains a challenge. Classic syntax-aware methods depend on intricate model architectures and loss functions, rendering their integration into LLM training resource-intensive. This paper employs in-context learning (ICL), which directly integrates task exemplars into the input context, to post-incorporate code structural knowledge into pre-trained LLMs. We revisit exemplar selection in ICL from an information-theoretic perspective, proposing that list-wise selection based on information coverage is more precise and general objective than traditional methods based on combining similarity and diversity. To address the challenges of quantifying information coverage, we introduce a surrogate measure, Coverage of Abstract Syntax Tree (CAST). Furthermore, we formulate the NP-hard CAST maximization for exemplar selection and prove that it is a standard submodular maximization problem. Therefore, we propose a greedy algorithm for CAST submodular maximization, which theoretically guarantees a (1-1/e)-approximate solution in polynomial time complexity. Our method is the first training-free and model-agnostic approach to post-incorporate code structural knowledge into existing LLMs at test time. Experimental results show that our method significantly improves LLMs performance and reveals two meaningful insights: 1) Code structural knowledge can be effectively post-incorporated into pre-trained LLMs during inference, despite being overlooked during training; 2) Scaling up model size or training data does not lead to the emergence of code structural knowledge, underscoring the necessity of explicitly considering code syntactic structure.

翻译：代码翻译旨在将代码库在不同编程语言间迁移。近年来，大型语言模型（LLMs）在软件挖掘领域取得了显著进展。然而，处理源代码的语法结构仍然是一个挑战。经典的语法感知方法依赖于复杂的模型架构和损失函数，导致其与LLM训练的集成需要大量资源。本文采用上下文学习（ICL）方法，通过将任务示例直接整合到输入上下文中，将代码结构知识后融入预训练的大型语言模型。我们从信息论的角度重新审视ICL中的示例选择，提出基于信息覆盖度的列表式选择方法，相比传统基于相似性和多样性组合的方法，是更精确且更具普适性的目标。为应对量化信息覆盖度的挑战，我们引入了一种替代度量指标——抽象语法树覆盖度（CAST）。此外，我们将示例选择中的NP-hard CAST最大化问题形式化，并证明其属于标准的子模最大化问题。因此，我们提出了一种用于CAST子模最大化的贪心算法，该算法在多项式时间复杂度下理论上能保证获得（1-1/e）近似解。我们的方法是首个在测试阶段将代码结构知识后融入现有LLMs的无训练、模型无关方法。实验结果表明，我们的方法显著提升了LLMs的性能，并揭示了两点重要发现：1）尽管在训练阶段被忽略，代码结构知识可以在推理阶段有效后融入预训练的LLMs；2）扩大模型规模或训练数据并不会自动产生代码结构知识，这凸显了显式考虑代码语法结构的必要性。