Revisiting Android App Categorization

Numerous tools rely on automatic categorization of Android apps as part of their methodology. However, incorrect categorization can lead to inaccurate outcomes, such as a malware detector wrongly flagging a benign app as malicious. One such example is the SlideIT Free Keyboard app, which has over 500000 downloads on Google Play. Despite being a "Keyboard" app, it is often wrongly categorized alongside "Language" apps due to the app's description focusing heavily on language support, resulting in incorrect analysis outcomes, including mislabeling it as a potential malware when it is actually a benign app. Hence, there is a need to improve the categorization of Android apps to benefit all the tools relying on it. In this paper, we present a comprehensive evaluation of existing Android app categorization approaches using our new ground-truth dataset. Our evaluation demonstrates the notable superiority of approaches that utilize app descriptions over those solely relying on data extracted from the APK file, while also leaving space for potential improvement in the former category. Thus, we propose two innovative approaches that effectively outperform the performance of existing methods in both description-based and APK-based methodologies. Finally, by employing our novel description-based approach, we have successfully demonstrated that adopting a higher-performing categorization method can significantly benefit tools reliant on app categorization, leading to an improvement in their overall performance. This highlights the significance of developing advanced and efficient app categorization methodologies for improved results in software engineering tasks.

翻译：众多工具将Android应用的自动分类作为其方法论的一部分。然而，错误的分类可能导致不准确的结果，例如恶意软件检测器错误地将良性应用标记为恶意应用。以SlideIT Free Keyboard应用为例，该应用在Google Play上的下载量超过50万次。尽管它是一款“键盘”应用，但由于其描述主要侧重于语言支持，它常常被错误地归类为“语言”应用，从而导致分析结果出错，包括将其误标为潜在恶意软件，而实际上它是一款良性应用。因此，有必要改进Android应用的分类，以使所有依赖它的工具受益。在本文中，我们利用新的真实数据集对现有Android应用分类方法进行了全面评估。我们的评估表明，利用应用描述的方法显著优于仅依赖从APK文件中提取数据的方法，同时前者仍有改进空间。因此，我们提出了两种创新方法，在基于描述和基于APK的方法中均能有效超越现有方法的性能。最后，通过采用我们新颖的基于描述的方法，我们成功证明，采用更高性能的分类方法可以显著惠及依赖应用分类的工具，从而提升其整体性能。这凸显了开发先进且高效的应用分类方法对于改进软件工程任务结果的重要性。

相关内容

TOOLS

关注 1

这个新版本的工具会议系列恢复了从1989年到2012年的50个会议的传统。工具最初是“面向对象语言和系统的技术”，后来发展到包括软件技术的所有创新方面。今天许多最重要的软件概念都是在这里首次引入的。2019年TOOLS 50+1在俄罗斯喀山附近举行，以同样的创新精神、对所有与软件相关的事物的热情、科学稳健性和行业适用性的结合以及欢迎该领域所有趋势和社区的开放态度，延续了该系列。官网链接：http://tools2019.innopolis.ru/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【ACL2020】多模态信息抽取，365页ppt

专知会员服务

151+阅读 · 2020年7月6日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日