Deep Learning (DL) models to analyze source code have shown immense promise during the past few years. More recently, self-supervised pre-training has gained traction for learning generic code representations valuable for many downstream SE tasks, such as clone and bug detection. While previous work successfully learned from different code abstractions (e.g., token, AST, graph), we argue that it is also essential to factor in how developers code day-to-day for general-purpose representation learning. On the one hand, human developers tend to write repetitive programs referencing existing code snippets from the current codebase or online resources (e.g., Stack Overflow website) rather than implementing functions from scratch; such behaviors result in a vast number of code clones. In contrast, a deviant clone by mistake might trigger malicious program behaviors. Thus, as a proxy to incorporate developers' coding behavior into the pre-training scheme, we propose to include code clones and their deviants. In particular, we propose CONCORD, a self-supervised, contrastive learning strategy to place benign clones closer in the representation space while moving deviants further apart. We show that CONCORD's clone-aware contrastive learning drastically reduces the need for expensive pre-training resources while improving the performance of downstream SE tasks. We also empirically demonstrate that CONCORD can improve existing pre-trained models to learn better representations that consequently become more efficient in both identifying semantically equivalent programs and differentiating buggy from non-buggy code.
翻译:在过去几年中,用于分析源代码的深度学习模型展现出巨大潜力。近期,自监督预训练在学习通用代码表示方面受到广泛关注,这些表示对克隆检测与缺陷检测等诸多下游软件工程任务具有重要价值。尽管已有工作成功利用了不同代码抽象形式(如词元、抽象语法树、图),但我们认为,在通用表示学习中引入开发者日常编码行为亦至关重要。一方面,人类开发者倾向于通过引用现有代码库或在线资源(如Stack Overflow网站)中的代码片段来编写重复性程序,而非从零实现函数;此类行为导致大量代码克隆出现。另一方面,恶意克隆变异可能引发程序异常行为。为此,我们提出将代码克隆及其变异体纳入预训练框架,以此作为融入开发者编码行为的代理方案。具体而言,我们提出CONCORD——一种自监督对比学习策略,旨在在表征空间中拉近良性克隆的距离,同时推远变异克隆的间距。实验表明,CONCORD的克隆感知对比学习能显著减少昂贵预训练资源的需求,同时提升下游软件工程任务性能。我们通过实证进一步证明,CONCORD可改进现有预训练模型以习得更优表示,从而在识别语义等价程序与区分缺陷/非缺陷代码方面均取得更高效率。