TransformCode: A Contrastive Learning Framework for Code Embedding via Subtree Transformation

Artificial intelligence (AI) has revolutionized software engineering (SE) by enhancing software development efficiency. The advent of pre-trained models (PTMs) leveraging transfer learning has significantly advanced AI for SE. However, existing PTMs that operate on individual code tokens suffer from several limitations: They are costly to train and fine-tune; and they rely heavily on labeled data for fine-tuning on task-specific datasets. In this paper, we present TransformCode, a novel framework that learns code embeddings in a contrastive learning manner. Our framework is encoder-agnostic and language-agnostic, which means that it can leverage any encoder model and handle any programming language. We also propose a novel data-augmentation technique called abstract syntax tree (AST) transformation, which applies syntactic and semantic transformations to the original code snippets, to generate more diverse and robust samples for contrastive learning. Our framework has several advantages over existing methods: (1) It is flexible and adaptable, because it can easily be extended to other downstream tasks that require code representation (such as code-clone detection and classification); (2) it is efficient and scalable, because it does not require a large model or a large amount of training data, and it can support any programming language; (3) it is not limited to unsupervised learning, but can also be applied to some supervised learning tasks by incorporating task-specific labels or objectives; and (4) it can also adjust the number of encoder parameters based on computing resources. We evaluate our framework on several code-related tasks, and demonstrate its effectiveness and superiority over the state-of-the-art methods such as SourcererCC, Code2vec, and InferCode.

翻译：人工智能（AI）通过提升软件开发效率彻底改变了软件工程领域。基于迁移学习的预训练模型（PTMs）的出现显著推动了面向软件工程的人工智能发展。然而，现有基于单个代码令牌的PTMs存在若干局限性：训练和微调成本高昂，且严重依赖标注数据完成特定任务的微调。本文提出TransformCode——一种以对比学习方式学习代码嵌入的新颖框架。该框架具有编码器无关与语言无关的特性，可兼容任意编码器模型并处理所有编程语言。我们同时提出名为抽象语法树（AST）变换的新型数据增强技术，通过对原始代码片段实施句法及语义变换，生成更具多样性和鲁棒性的对比学习样本。相较于现有方法，本框架具有以下优势：（1）灵活适应性强，可轻松扩展至需要代码表示的其他下游任务（如代码克隆检测与分类）；（2）高效可扩展，无需大型模型或海量训练数据，且支持任意编程语言；（3）不仅适用于无监督学习，还可通过整合任务特定标签或目标应用于监督学习任务；（4）能根据计算资源动态调整编码器参数规模。我们在多个代码相关任务上评估本框架，实验结果表明其有效性与优越性超越了SourcererCC、Code2vec、InferCode等当前最优方法。