Automated program repair (APR) aims to fix software bugs automatically and plays a crucial role in software development and maintenance. With the recent advances in deep learning (DL), an increasing number of APR techniques have been proposed to leverage neural networks to learn bug-fixing patterns from massive open-source code repositories. Such learning-based techniques usually treat APR as a neural machine translation (NMT) task, where buggy code snippets (i.e., source language) are translated into fixed code snippets (i.e., target language) automatically. Benefiting from the powerful capability of DL to learn hidden relationships from previous bug-fixing datasets, learning-based APR techniques have achieved remarkable performance. In this paper, we provide a systematic survey to summarize the current state-of-the-art research in the learning-based APR community. We illustrate the general workflow of learning-based APR techniques and detail the crucial components, including fault localization, patch generation, patch ranking, patch validation, and patch correctness phases. We then discuss the widely-adopted datasets and evaluation metrics and outline existing empirical studies. We discuss several critical aspects of learning-based APR techniques, such as repair domains, industrial deployment, and the open science issue. We highlight several practical guidelines on applying DL techniques for future APR studies, such as exploring explainable patch generation and utilizing code features. Overall, our paper can help researchers gain a comprehensive understanding about the achievements of the existing learning-based APR techniques and promote the practical application of these techniques. Our artifacts are publicly available at \url{https://github.com/QuanjunZhang/AwesomeLearningAPR}.
翻译:自动程序修复(APR)旨在自动修复软件缺陷,在软件开发和维护中发挥着关键作用。随着深度学习(DL)的最新进展,越来越多的APR技术被提出,利用神经网络从大规模开源代码仓库中学习缺陷修复模式。这类基于学习的技术通常将APR视为神经机器翻译(NMT)任务,其中缺陷代码片段(即源语言)被自动翻译为修复后的代码片段(即目标语言)。得益于深度学习从历史缺陷修复数据集中学习隐藏关系的强大能力,基于学习的APR技术取得了显著性能。本文通过系统性综述,总结了当前基于学习的APR社区中最前沿的研究成果。我们阐述了基于学习的APR技术的一般工作流程,并详细介绍了关键组件,包括故障定位、补丁生成、补丁排序、补丁验证和补丁正确性阶段。随后讨论了广泛采用的数据集和评估指标,并概述了现有的实证研究。我们探讨了基于学习的APR技术的若干关键方面,如修复领域、工业部署和开放科学问题。针对未来APR研究,我们强调了应用深度学习技术的若干实践指南,例如探索可解释的补丁生成和利用代码特征。总体而言,本文可帮助研究人员全面了解现有基于学习的APR技术的成就,并促进这些技术的实际应用。我们的相关资源公开于\url{https://github.com/QuanjunZhang/AwesomeLearningAPR}。