Code cloning, a widespread practice in software development, involves replicating code fragments to save time but often at the expense of software maintainability and quality. In this paper, we address the specific challenge of detecting "essence clones", a complex subtype of Type-3 clones characterized by sharing critical logic despite different peripheral codes. Traditional techniques often fail to detect essence clones due to their syntactic focus. To overcome this limitation, we introduce ECScan, a novel detection tool that leverages information theory to assess the semantic importance of code lines. By assigning weights to each line based on its information content, ECScan emphasizes core logic over peripheral code differences. Our comprehensive evaluation across various real-world projects shows that ECScan significantly outperforms existing tools in detecting essence clones, achieving an average F1-score of 85%. It demonstrates robust performance across all clone types and offers exceptional scalability. This study advances clone detection by providing a practical tool for developers to enhance code quality and reduce maintenance burdens, emphasizing the semantic aspects of code through an innovative information-theoretic approach.
翻译:代码克隆是软件开发中普遍存在的实践,它通过复制代码片段来节省时间,但往往以牺牲软件可维护性和质量为代价。本文针对检测"本质克隆"这一特定挑战展开研究,本质克隆是Type-3克隆的复杂子类型,其特征是在外围代码不同的情况下仍共享关键逻辑。传统技术由于侧重于语法层面,常常无法检测到本质克隆。为克服这一局限,我们提出了ECScan——一种新颖的检测工具,它利用信息论来评估代码行的语义重要性。通过根据信息量为每行代码分配权重,ECScan能够突出核心逻辑而非外围代码的差异。我们在多个实际项目上的综合评估表明,ECScan在检测本质克隆方面显著优于现有工具,平均F1分数达到85%。该工具在所有克隆类型上都表现出稳健的性能,并具有卓越的可扩展性。本研究通过创新的信息论方法强调代码的语义层面,为开发者提供了提升代码质量、减轻维护负担的实用工具,从而推动了克隆检测领域的发展。