This paper reviews work published between 2002 and 2022 in the fields of Android malware, clone, and similarity detection. It examines the data sources, tools, and features used in existing research and identifies the need for a comprehensive, cross-domain dataset to facilitate interdisciplinary collaboration and the exploitation of synergies between different research areas. Furthermore, it shows that many research papers do not publish the dataset or a description of how it was created, making it difficult to reproduce or compare the results. The paper highlights the necessity for a dataset that is accessible, well-documented, and suitable for a range of applications. Guidelines are provided for this purpose, along with a schematic method for creating the dataset.
翻译:本文综述了2002年至2022年间在Android恶意软件、克隆应用及相似性检测领域发表的研究工作。文章系统分析了现有研究采用的数据来源、工具与特征,指出当前缺乏能够促进跨学科合作、实现不同研究领域协同增效的综合性跨领域数据集。研究进一步表明,许多学术论文未公开其使用的数据集或数据构建过程的详细描述,导致实验结果难以复现或比较。本文强调需要构建一个易于获取、文档完备且适用于多类应用场景的数据集,并为此提供了具体指导原则及数据构建的框架性方法。