Patch-CLIP: A Patch-Text Pre-Trained Model

In recent years, patch representation learning has emerged as a necessary research direction for exploiting the capabilities of machine learning in software generation. These representations have driven significant performance enhancements across a variety of tasks involving code changes. While the progress is undeniable, a common limitation among existing models is their specialization: they predominantly excel in either predictive tasks, such as security patch classification, or in generative tasks such as patch description generation. This dichotomy is further exacerbated by a prevalent dependency on potentially noisy data sources. Specifically, many models utilize patches integrated with Abstract Syntax Trees (AST) that, unfortunately, may contain parsing inaccuracies, thus acting as a suboptimal source of supervision. In response to these challenges, we introduce PATCH-CLIP, a novel pre-training framework for patches and natural language text. PATCH-CLIP deploys a triple-loss training strategy for 1) patch-description contrastive learning, which enables to separate patches and descriptions in the embedding space, 2) patch-description matching, which ensures that each patch is associated to its description in the embedding space, and 3) patch-description generation, which ensures that the patch embedding is effective for generation. These losses are implemented for joint learning to achieve good performance in both predictive and generative tasks involving patches. Empirical evaluations focusing on patch description generation, demonstrate that PATCH-CLIP sets new state of the art performance, consistently outperforming the state-of-the-art in metrics like BLEU, ROUGE-L, METEOR, and Recall.

翻译：近年来，补丁表示学习已成为利用机器学习在软件生成领域中能力的重要研究方向。这些表示推动了涉及代码变更的多种任务的性能显著提升。尽管进展显著，但现有模型普遍存在一个局限性：它们通常专精于单一任务类型，要么擅长预测性任务（如安全补丁分类），要么擅长生成性任务（如补丁描述生成）。这种二分法因对潜在噪声数据源的普遍依赖而进一步加剧。具体而言，许多模型使用了与抽象语法树（AST）集成的补丁，但AST可能存在解析不准确的问题，从而成为次优的监督来源。针对这些挑战，我们提出了PATCH-CLIP，一种全新的补丁与自然语言文本预训练框架。PATCH-CLIP采用三重损失训练策略，包括：1）补丁-描述对比学习，使其能够在嵌入空间中分离补丁与描述；2）补丁-描述匹配，确保每个补丁在嵌入空间中与其描述相关联；3）补丁-描述生成，确保补丁嵌入对生成任务有效。这些损失函数通过联合学习实现，使模型在涉及补丁的预测性和生成性任务中均取得良好性能。针对补丁描述生成任务的实证评估表明，PATCH-CLIP达到了新的最先进性能，在BLEU、ROUGE-L、METEOR和Recall等指标上持续超越现有最优方法。