The increasingly popular adoption of deep learning models in many critical source code tasks motivates the development of data augmentation (DA) techniques to enhance training data and improve various capabilities (e.g., robustness and generalizability) of these models. Although a series of DA methods have been proposed and tailored for source code models, there lacks a comprehensive survey and examination to understand their effectiveness and implications. This paper fills this gap by conducting a comprehensive and integrative survey of data augmentation for source code, wherein we systematically compile and encapsulate existing literature to provide a comprehensive overview of the field. We start with an introduction of data augmentation in source code and then provide a discussion on major representative approaches. Next, we highlight the general strategies and techniques to optimize the DA quality. Subsequently, we underscore techniques useful in real-world source code scenarios and downstream tasks. Finally, we outline the prevailing challenges and potential opportunities for future research. In essence, we aim to demystify the corpus of existing literature on source code DA for deep learning, and foster further exploration in this sphere. Complementing this, we present a continually updated GitHub repository that hosts a list of update-to-date papers on DA for source code modeling, accessible at \url{https://github.com/terryyz/DataAug4Code}.
翻译:随着深度学习模型在众多关键源代码任务中的日益广泛应用,数据增强技术被开发用于提升训练数据质量并增强模型的各项能力(如鲁棒性和泛化性)。尽管已有系列针对源代码模型定制的数据增强方法被提出,但尚缺乏系统性的调研来理解其有效性与影响。本文通过开展关于源代码数据增强的全面整合性综述填补这一空白,系统汇编并凝练现有文献,提供该领域的全景式概述。我们首先介绍源代码中的数据增强概念,继而讨论主要代表性方法,随后重点阐述优化数据增强质量的通用策略与技术,进而强调在实际源代码场景与下游任务中有效的技术,最后概述当前面临的挑战与未来研究机遇。本质上,我们旨在厘清现有深度学习中源代码数据增强文献体系,并促进该领域的深入探索。此外,我们在持续更新的GitHub仓库中收录了关于源代码建模数据增强的最新论文列表,可通过\url{https://github.com/terryyz/DataAug4Code}访问。