Incorprating Prompt tuning for Commit classification with prior Knowledge

Commit Classification(CC) is an important task in software maintenance since it helps software developers classify code changes into different types according to their nature and purpose. This allows them to better understand how their development efforts are progressing, identify areas where they need improvement. However, existing methods are all discriminative models, usually with complex architectures that require additional output layers to produce class label probabilities. Moreover, they require a large amount of labeled data for fine-tuning, and it is difficult to learn effective classification boundaries in the case of limited labeled data. To solve above problems, we propose a generative framework that Incorporating prompt-tuning for commit classification with prior knowledge (IPCK) https://github.com/AppleMax1992/IPCK, which simplifies the model structure and learns features across different tasks. It can still reach the SOTA performance with only limited samples. Firstly, we proposed a generative framework based on T5. This encoder-decoder construction method unifies different CC task into a text2text problem, which simplifies the structure of the model by not requiring an extra output layer. Second, instead of fine-tuning, we design an prompt-tuning solution which can be adopted in few-shot scenarios with only limit samples. Furthermore, we incorporate prior knowledge via an external knowledge graph to map the probabilities of words into the final labels in the speech machine step to improve performance in few-shot scenarios. Extensive experiments on two open available datasets show that our framework can solve the CC problem simply but effectively in few-shot and zeroshot scenarios, while improving the adaptability of the model without requiring a large amount of training samples for fine-tuning.

翻译：提交分类是软件维护中的一项重要任务，它能帮助软件开发人员根据代码变更的性质和目的将其分为不同类型，从而更好地了解开发进展并识别需要改进的领域。然而，现有方法均为判别式模型，通常具有复杂的架构，需要额外的输出层来生成类别标签概率。此外，这些方法需要大量标注数据进行微调，在标注数据有限的情况下难以学习有效的分类边界。为解决上述问题，我们提出一种基于先验知识融入提示调优的生成式提交分类框架IPCK（代码见https://github.com/AppleMax1992/IPCK），该框架简化了模型结构并跨任务学习特征，即使在有限样本下仍能达到SOTA性能。首先，我们基于T5模型提出了生成式框架。这种编码器-解码器结构将不同的提交分类任务统一为文本到文本问题，无需额外输出层即可简化模型结构。其次，我们设计了提示调优方案替代传统微调，使其能够适用于仅有少量样本的少样本场景。此外，我们通过外部知识图谱融入先验知识，在解码步骤中将单词概率映射为最终标签，以提升少样本场景下的性能。在两个公开数据集上的大量实验表明，我们的框架能够简单而有效地解决少样本和零样本场景下的提交分类问题，同时无需大量训练样本进行微调即可提升模型适应性。