Pretrained language models for code token embeddings are used in code search, code clone detection, and other code-related tasks. Similarly, code function embeddings are useful in such tasks. However, there are no out-of-box models for function embeddings in the current literature. So, this paper proposes CodeCSE, a contrastive learning model that learns embeddings for functions and their descriptions in one space. We evaluated CodeCSE using code search. CodeCSE's multi-lingual zero-shot approach is as efficient as the models finetuned from GraphCodeBERT for specific languages. CodeCSE is open source at https://github.com/emu-se/codecse and the pretrained model is available at the HuggingFace public hub: https://huggingface.co/sjiang1/codecse
翻译:代码令牌嵌入的预训练语言模型广泛应用于代码搜索、代码克隆检测及其他代码相关任务。类似地,代码函数嵌入在此类任务中同样具有重要价值。然而,当前文献中尚缺乏可直接使用的函数嵌入模型。为此,本文提出CodeCSE——一种基于对比学习的模型,能够在统一空间中学习函数及其描述的嵌入表示。我们通过代码搜索任务对CodeCSE进行评估。实验表明,CodeCSE的多语言零样本方法在特定语言上的表现与基于GraphCodeBERT微调的模型相当。CodeCSE已在https://github.com/emu-se/codecse开源,预训练模型发布于HuggingFace公共平台:https://huggingface.co/sjiang1/codecse