Developers introduce code clones to improve programming productivity. Many existing studies have achieved impressive performance in monolingual code clone detection. However, during software development, more and more developers write semantically equivalent programs with different languages to support different platforms and help developers translate projects from one language to another. Considering that collecting cross-language parallel data, especially for low-resource languages, is expensive and time-consuming, how designing an effective cross-language model that does not rely on any parallel data is a significant problem. In this paper, we propose a novel method named ZC3 for Zero-shot Cross-language Code Clone detection. ZC3 designs the contrastive snippet prediction to form an isomorphic representation space among different programming languages. Based on this, ZC3 exploits domain-aware learning and cycle consistency learning to further constrain the model to generate representations that are aligned among different languages meanwhile are diacritical for different types of clones. To evaluate our approach, we conduct extensive experiments on four representative cross-language clone detection datasets. Experimental results show that ZC3 outperforms the state-of-the-art baselines by 67.12%, 51.39%, 14.85%, and 53.01% on the MAP score, respectively. We further investigate the representational distribution of different languages and discuss the effectiveness of our method.
翻译:摘要:开发者通过引入代码克隆来提高编程效率。现有许多研究在单语言代码克隆检测中已取得显著成果。然而,在软件开发过程中,越来越多的开发者使用不同语言编写语义等效的程序,以支持不同平台,并帮助开发者将项目从一种语言迁移至另一种语言。考虑到收集跨语言并行数据(尤其是低资源语言)成本高昂且耗时,如何设计不依赖任何并行数据的有效跨语言模型成为一个重要问题。本文提出一种名为ZC3的新方法,用于零样本跨语言代码克隆检测。ZC3通过设计对比片段预测,在不同编程语言间构建同构表示空间。在此基础上,ZC3利用领域感知学习和循环一致性学习进一步约束模型,以生成跨语言对齐且对不同类型克隆具有区分性的表示。为评估该方法,我们在四个代表性跨语言代码克隆检测数据集上进行了广泛实验。实验结果显示,ZC3在MAP得分上分别比现有最优基线方法提升67.12%、51.39%、14.85%和53.01%。我们进一步研究了不同语言的表示分布,并讨论了方法的有效性。