The capability of accurately determining code similarity is crucial in many tasks related to software development. For example, it might be essential to identify code duplicates for performing software maintenance. This research introduces a novel ensemble learning approach for code similarity assessment, combining the strengths of multiple unsupervised similarity measures. The key idea is that the strengths of a diverse set of similarity measures can complement each other and mitigate individual weaknesses, leading to improved performance. Preliminary results show that while Transformers-based CodeBERT and its variant GraphCodeBERT are undoubtedly the best option in the presence of abundant training data, in the case of specific small datasets (up to 500 samples), our ensemble achieves similar results, without prejudice to the interpretability of the resulting solution, and with a much lower associated carbon footprint due to training. The source code of this novel approach can be downloaded from https://github.com/jorge-martinez-gil/ensemble-codesim.
翻译:准确判定代码相似性的能力对于软件开发相关的众多任务至关重要。例如,为执行软件维护而识别代码重复项可能至关重要。本研究提出了一种新颖的集成学习方法用于代码相似性评估,该方法结合了多种无监督相似度度量的优势。其核心思想在于:一组多样化相似度度量可相互弥补各自弱点、实现优势互补,从而提升性能。初步结果表明:虽然基于Transformer的CodeBERT及其变体GraphCodeBERT在训练数据充足时无疑是最优选择,但在特定小规模数据集(最多500个样本)场景下,我们的集成方法在保持结果可解释性的同时取得了相近的性能,且因训练过程产生的碳足迹显著降低。本方法的源代码可从https://github.com/jorge-martinez-gil/ensemble-codesim获取。