Android applications (apps) developers increasingly rely on code obfuscation techniques to hinder reverse engineering and protect intellectual property. However, obfuscation also reduces the effectiveness of static analysis and vulnerability detection tools, creating challenges for Android security analysis. Existing approaches for detecting obfuscation in Android apps predominantly rely on handcrafted heuristics, engineered features, or task-specific learning pipelines, which may struggle to generalize across evolving obfuscation strategies. This paper presents a large-scale empirical study investigating the capability of Large Language Models (LLMs) to detect obfuscation in Android apps through semantic reasoning. Our study evaluates whether off-the-shelf LLMs can identify obfuscated code without relying on handcrafted rules, predefined signatures, or dedicated model training. The empirical evaluation is conducted on both a controlled benchmark containing an app obfuscated with multiple techniques and a real-world dataset of Android apps collected from Google Play. The study further examines the impact of prompt design, model selection, and decision thresholds across several open-weight and proprietary LLMs. Finally, the analysis compares LLM-based reasoning with existing SAST-based obfuscation-detection approaches and discusses the broader implications and limitations of applying LLMs to Android security analysis.
翻译:Android 应用开发者日益依赖代码混淆技术来阻碍逆向工程并保护知识产权。然而,混淆技术同时降低了静态分析与漏洞检测工具的有效性,为 Android 安全分析带来了挑战。现有检测 Android 应用混淆的方法主要依赖人工设计的启发式规则、工程化特征或专用学习流程,难以泛化应对不断演进的混淆策略。本文通过大规模实证研究,探究大语言模型基于语义推理检测 Android 应用混淆的能力。研究评估了现成大语言模型在无需人工规则、预定义签名或专用模型训练的情况下识别混淆代码的效果。实证评估既包含采用多种技术混淆的应用构建的受控基准集,也包含从 Google Play 收集的真实 Android 应用数据集。研究进一步分析了提示设计、模型选择及决策阈值对若干开源和闭源大语言模型的影响。最终,本文对比了基于大语言模型的推理方法与现有基于 SAST 的混淆检测方案,并探讨了大语言模型应用于 Android 安全分析的更广泛意义与局限性。