Just-in-Time Code Duplicates Extraction

Refactoring is a critical task in software maintenance, and is usually performed to enforce better design and coding practices, while coping with design defects. The Extract Method refactoring is widely used for merging duplicate code fragments into a single new method. Several studies attempted to recommend Extract Method refactoring opportunities using different techniques, including program slicing, program dependency graph analysis, change history analysis, structural similarity, and feature extraction. However, irrespective of the method, most of the existing approaches interfere with the developer's workflow: they require the developer to stop coding and analyze the suggested opportunities, and also consider all refactoring suggestions in the entire project without focusing on the development context. To increase the adoption of the Extract Method refactoring, in this paper, we aim to investigate the effectiveness of machine learning and deep learning algorithms for its recommendation while maintaining the workflow of the developer. The proposed approach relies on mining prior applied Extract Method refactorings and extracting their features to train a deep learning classifier that detects them in the user's code. We implemented our approach as a plugin for IntelliJ IDEA called AntiCopyPaster. To develop our approach, we trained and evaluated various popular models on a dataset of 18,942 code fragments from 13 Open Source Apache projects. The results show that the best model is the Convolutional Neural Network (CNN), which recommends appropriate Extract Method refactorings with an F-measure of 0.82. We also conducted a qualitative study with 72 developers to evaluate the usefulness of the developed plugin. The results show that developers tend to appreciate the idea of the approach and are satisfied with various aspects of the plugin's operation.

翻译：重构是软件维护中的关键任务，通常用于强化更好的设计和编码实践，同时应对设计缺陷。提取方法重构被广泛用于将重复代码片段合并为单个新方法。多项研究尝试使用不同技术（包括程序切片、程序依赖图分析、变更历史分析、结构相似性和特征提取）来推荐提取方法重构的机会。然而，无论采用何种方法，大多数现有方法都会干扰开发者的工作流程：它们要求开发者停止编码并分析所建议的机会，同时考虑整个项目中所有的重构建议，而没有关注开发上下文。为了提高提取方法重构的采纳率，本文旨在研究机器学习和深度学习算法在推荐该重构时的有效性，同时保持开发者的工作流程。所提出的方法依赖于挖掘先前应用的提取方法重构并提取其特征，以训练一个深度学习分类器，用于检测用户代码中的这些重构。我们将该方法作为IntelliJ IDEA的插件实现，称为AntiCopyPaster。为了开发该方法，我们在来自13个Apache开源项目的18,942个代码片段数据集上训练并评估了多种流行模型。结果表明，最佳模型是卷积神经网络（CNN），其推荐的适当提取方法重构的F值达到0.82。我们还对72名开发者进行了定性研究，以评估所开发插件的实用性。结果表明，开发者倾向于认可该方法的理念，并对插件运行的各个方面感到满意。