This paper presents an improved loss function for neural source code summarization. Code summarization is the task of writing natural language descriptions of source code. Neural code summarization refers to automated techniques for generating these descriptions using neural networks. Almost all current approaches involve neural networks as either standalone models or as part of a pretrained large language models e.g., GPT, Codex, LLaMA. Yet almost all also use a categorical cross-entropy (CCE) loss function for network optimization. Two problems with CCE are that 1) it computes loss over each word prediction one-at-a-time, rather than evaluating a whole sentence, and 2) it requires a perfect prediction, leaving no room for partial credit for synonyms. We propose and evaluate a loss function to alleviate this problem. In essence, we propose to use a semantic similarity metric to calculate loss over the whole output sentence prediction per training batch, rather than just loss for each word. We also propose to combine our loss with traditional CCE for each word, which streamlines the training process compared to baselines. We evaluate our approach over several baselines and report an improvement in the vast majority of conditions.
翻译:本文提出了一种针对神经源代码摘要的改进损失函数。代码摘要是用自然语言描述源代码的任务。神经代码摘要是指利用神经网络自动生成这些描述的技术。当前几乎所有方法都将神经网络作为独立模型或预训练大型语言模型(如GPT、Codex、LLaMA)的一部分。然而,几乎所有方法都使用分类交叉熵(CCE)损失函数进行网络优化。CCE存在两个问题:1)它逐词计算损失,而非评估整个句子;2)它要求完全正确的预测,不允许同义词的部分得分。我们提出并评估了一种损失函数以缓解这一问题。本质上,我们提出使用语义相似性度量在每个训练批次中计算整个输出句子的损失,而非仅计算每个单词的损失。我们还提出将我们的损失与传统的逐词CCE相结合,相较于基线方法,这简化了训练过程。我们在多个基线上评估了该方法,并报告了在绝大多数情况下的改进效果。