The increasing frequency of attacks on Android applications coupled with the recent popularity of large language models (LLMs) necessitates a comprehensive understanding of the capabilities of the latter in identifying potential vulnerabilities, which is key to mitigate the overall risk. To this end, the work at hand compares the ability of nine state-of-the-art LLMs to detect Android code vulnerabilities listed in the latest Open Worldwide Application Security Project (OWASP) Mobile Top 10. Each LLM was evaluated against an open dataset of over 100 vulnerable code samples, including obfuscated ones, assessing each model's ability to identify key vulnerabilities. Our analysis reveals the strengths and weaknesses of each LLM, identifying important factors that contribute to their performance. Additionally, we offer insights into context augmentation with retrieval-augmented generation (RAG) for detecting Android code vulnerabilities, which in turn may propel secure application development. Finally, while the reported findings regarding code vulnerability analysis show promise, they also reveal significant discrepancies among the different LLMs.
翻译:随着Android应用遭受攻击的频率日益增加,以及近期大语言模型(LLMs)的广泛流行,全面理解后者在识别潜在漏洞方面的能力变得至关重要,这是降低整体风险的关键。为此,本研究比较了九种先进LLMs检测最新版开放全球应用安全项目(OWASP)移动端十大安全风险中所列Android代码漏洞的能力。每种LLM均在一个包含100多个易受攻击代码样本(包括混淆样本)的开放数据集上进行了评估,以检验各模型识别关键漏洞的能力。我们的分析揭示了每种LLM的优势与不足,并识别了影响其性能的关键因素。此外,我们深入探讨了利用检索增强生成(RAG)进行上下文增强以检测Android代码漏洞的见解,这可能进一步推动安全应用开发。最后,尽管关于代码漏洞分析的报告结果显示出积极前景,但也揭示了不同LLMs之间存在显著差异。