Numerous tools rely on automatic categorization of Android apps as part of their methodology. However, incorrect categorization can lead to inaccurate outcomes, such as a malware detector wrongly flagging a benign app as malicious. One such example is the SlideIT Free Keyboard app, which has over 500000 downloads on Google Play. Despite being a "Keyboard" app, it is often wrongly categorized alongside "Language" apps due to the app's description focusing heavily on language support, resulting in incorrect analysis outcomes, including mislabeling it as a potential malware when it is actually a benign app. Hence, there is a need to improve the categorization of Android apps to benefit all the tools relying on it. In this paper, we present a comprehensive evaluation of existing Android app categorization approaches using our new ground-truth dataset. Our evaluation demonstrates the notable superiority of approaches that utilize app descriptions over those solely relying on data extracted from the APK file, while also leaving space for potential improvement in the former category. Thus, we propose two innovative approaches that effectively outperform the performance of existing methods in both description-based and APK-based methodologies. Finally, by employing our novel description-based approach, we have successfully demonstrated that adopting a higher-performing categorization method can significantly benefit tools reliant on app categorization, leading to an improvement in their overall performance. This highlights the significance of developing advanced and efficient app categorization methodologies for improved results in software engineering tasks.
翻译:众多工具将Android应用的自动分类作为其方法论的一部分。然而,错误的分类可能导致不准确的结果,例如恶意软件检测器错误地将良性应用标记为恶意应用。以SlideIT Free Keyboard应用为例,该应用在Google Play上的下载量超过50万次。尽管它是一款“键盘”应用,但由于其描述主要侧重于语言支持,它常常被错误地归类为“语言”应用,从而导致分析结果出错,包括将其误标为潜在恶意软件,而实际上它是一款良性应用。因此,有必要改进Android应用的分类,以使所有依赖它的工具受益。在本文中,我们利用新的真实数据集对现有Android应用分类方法进行了全面评估。我们的评估表明,利用应用描述的方法显著优于仅依赖从APK文件中提取数据的方法,同时前者仍有改进空间。因此,我们提出了两种创新方法,在基于描述和基于APK的方法中均能有效超越现有方法的性能。最后,通过采用我们新颖的基于描述的方法,我们成功证明,采用更高性能的分类方法可以显著惠及依赖应用分类的工具,从而提升其整体性能。这凸显了开发先进且高效的应用分类方法对于改进软件工程任务结果的重要性。