Regarding software engineering (SE) tasks, Large language models (LLMs) have the capability of zero-shot learning, which does not require training or fine-tuning, unlike pre-trained models (PTMs). However, LLMs are primarily designed for natural language output, and cannot directly produce intermediate embeddings from source code. They also face some challenges, for example, the restricted context length may prevent them from handling larger inputs, limiting their applicability to many SE tasks; while hallucinations may occur when LLMs are applied to complex downstream tasks. Motivated by the above facts, we propose zsLLMCode, a novel approach that generates functional code embeddings using LLMs. Our approach utilizes LLMs to convert source code into concise summaries through zero-shot learning, which is then transformed into functional code embeddings using specialized embedding models. This unsupervised approach eliminates the need for training and addresses the issue of hallucinations encountered with LLMs. To the best of our knowledge, this is the first approach that combines LLMs and embedding models to generate code embeddings. We conducted experiments to evaluate the performance of our approach. The results demonstrate the effectiveness and superiority of our approach over state-of-the-art unsupervised methods.
翻译:在软件工程任务中,大语言模型具备零样本学习能力,无需像预训练模型那样进行训练或微调。然而,大语言模型主要设计用于自然语言输出,无法直接从源代码生成中间嵌入表示。它们还面临一些挑战,例如:有限的上下文长度可能使其无法处理较大输入,从而限制其在众多软件工程任务中的适用性;同时,当大语言模型应用于复杂下游任务时可能出现幻觉现象。基于上述事实,我们提出了zsLLMCode——一种利用大语言模型生成功能代码嵌入的新方法。该方法通过零样本学习将源代码转换为简洁摘要,随后借助专用嵌入模型将其转化为功能代码嵌入。这种无监督方法无需训练过程,并解决了大语言模型存在的幻觉问题。据我们所知,这是首个结合大语言模型与嵌入模型生成代码嵌入的方法。我们通过实验评估了该方法的性能,结果表明其相较于当前最先进的无监督方法具有显著的有效性和优越性。