Recently, code language models have achieved notable advancements in addressing a diverse array of essential code comprehension and generation tasks. Yet, the field lacks a comprehensive deep dive and understanding of the code embeddings of multilingual code models. In this paper, we present a comprehensive study on multilingual code embeddings, focusing on the cross-lingual capabilities of these embeddings across different programming languages. Through probing experiments, we demonstrate that code embeddings comprise two distinct components: one deeply tied to the nuances and syntax of a specific language, and the other remaining agnostic to these details, primarily focusing on semantics. Further, we show that when we isolate and eliminate this language-specific component, we witness significant improvements in downstream code retrieval tasks, leading to an absolute increase of up to +17 in the Mean Reciprocal Rank (MRR).
翻译:近期,代码语言模型在解决一系列关键的代码理解与生成任务中取得了显著进展。然而,该领域对多语言代码模型的代码嵌入缺乏全面深入的剖析与理解。本文针对多语言代码嵌入展开系统性研究,重点探讨这些嵌入在不同编程语言间的跨语言能力。通过探测实验,我们证明代码嵌入由两个不同部分组成:一个与特定语言的细节和语法紧密相关,另一个则对这些细节无关,主要关注语义。进一步,我们发现当分离并消除这一语言特定部分后,下游代码检索任务性能显著提升,平均倒数排名(MRR)绝对增幅高达+17。