Despite the continued research and progress in building secure systems, Android applications continue to be ridden with vulnerabilities, necessitating effective detection methods. Current strategies involving static and dynamic analysis tools come with limitations like overwhelming number of false positives and limited scope of analysis which make either difficult to adopt. Over the past years, machine learning based approaches have been extensively explored for vulnerability detection, but its real-world applicability is constrained by data requirements and feature engineering challenges. Large Language Models (LLMs), with their vast parameters, have shown tremendous potential in understanding semnatics in human as well as programming languages. We dive into the efficacy of LLMs for detecting vulnerabilities in the context of Android security. We focus on building an AI-driven workflow to assist developers in identifying and rectifying vulnerabilities. Our experiments show that LLMs outperform our expectations in finding issues within applications correctly flagging insecure apps in 91.67% of cases in the Ghera benchmark. We use inferences from our experiments towards building a robust and actionable vulnerability detection system and demonstrate its effectiveness. Our experiments also shed light on how different various simple configurations can affect the True Positive (TP) and False Positive (FP) rates.
翻译:尽管在构建安全系统方面持续研究和取得进展,但Android应用程序仍然漏洞频发,亟需有效的检测方法。当前涉及静态和动态分析工具的策略存在局限性,例如误报数量过多和分析范围有限,这使得它们难以被广泛采用。过去几年中,基于机器学习的方法已被广泛探索用于漏洞检测,但其实际应用受到数据需求和特征工程挑战的制约。具有海量参数的大型语言模型在理解人类语言及编程语言的语义方面展现出巨大潜力。我们深入探讨了LLMs在Android安全场景下检测漏洞的有效性,专注于构建AI驱动的工作流,以协助开发者识别和修复漏洞。实验结果表明,LLMs在发现应用程序中的问题方面表现超出预期,在Ghera基准测试中正确标记了91.67%的不安全应用。我们利用实验中的推断构建了一个稳健且可操作的漏洞检测系统,并证明了其有效性。实验还揭示了不同简单配置如何影响真阳性率和假阳性率。