Can LLMs Handle WebShell Detection? Overcoming Detection Challenges with Behavioral Function-Aware Framework

WebShell attacks - where adversaries implant malicious scripts on web servers - remain a persistent threat. Prior machine-learning and deep-learning detectors typically depend on task-specific supervision and can be brittle under data scarcity, rapid concept drift, and out-of-distribution (OOD) deployment. Large language models (LLMs) have recently shown strong code understanding capabilities, but their reliability for WebShell detection remains unclear. We address this gap by (i) systematically evaluating seven LLMs (including GPT-4, LLaMA-3.1-70B, and Qwen-2.5 variants) against representative sequence- and graph-based baselines on 26.59K PHP scripts, and (ii) proposing Behavioral Function-Aware Detection (BFAD), a behavior-centric framework that adapts LLM inference to WebShell-specific execution patterns. BFAD anchors analysis on security-sensitive PHP functions via a Critical Function Filter, constructs compact LLM inputs with Context-Aware Code Extraction, and selects in-context demonstrations using Weighted Behavioral Function Profiling, which ranks examples by a behavior-weighted, function-level similarity. Empirically, we observe a consistent precision-recall asymmetry: larger LLMs often achieve high precision but miss attacks (lower recall), while smaller models exhibit the opposite tendency; moreover, off-the-shelf LLM prompting underperforms established detectors. BFAD substantially improves all evaluated LLMs, boosting F1 by 13.82% on average; notably, GPT-4, LLaMA-3.1-70B, and Qwen-2.5-Coder-14B exceed prior SOTA benchmarks, while Qwen-2.5-Coder-3B becomes competitive with traditional methods. Overall, our results clarify when LLMs succeed or fail on WebShell detection, provide a practical recipe, and highlight future directions for making LLM-based detection more reliable.

翻译：WebShell攻击——攻击者在Web服务器中植入恶意脚本——仍然是一种持续存在的威胁。现有的机器学习和深度学习检测器通常依赖于特定任务的监督，在数据稀缺、概念快速漂移和分布外（OOD）部署场景下表现脆弱。大型语言模型（LLMs）近期展现出强大的代码理解能力，但其用于WebShell检测的可靠性尚不明确。为填补这一空白，我们（i）系统性地评估了七种LLMs（包括GPT-4、LLaMA-3.1-70B和Qwen-2.5系列变体），在26.59K个PHP脚本上对比了基于序列和图表示的基线方法；（ii）提出了行为函数感知检测（BFAD），这是一个以行为为中心的框架，旨在使LLM推理适应WebShell特有的执行模式。BFAD通过关键函数过滤器将分析锚定在安全敏感的PHP函数上，利用上下文感知代码提取构建紧凑的LLM输入，并采用加权行为函数分析选择上下文示例，该方法通过行为加权的函数级相似度对示例进行排序。实证研究表明，我们观察到一种一致的精确率-召回率不对称现象：较大的LLMs通常能实现高精确率但会漏报攻击（召回率较低），而较小的模型则表现出相反的趋势；此外，现成的LLM提示方法表现不及已有的检测器。BFAD显著提升了所有被评估的LLMs，平均F1分数提高了13.82%；值得注意的是，GPT-4、LLaMA-3.1-70B和Qwen-2.5-Coder-14B超越了先前的最先进基准，而Qwen-2.5-Coder-3B则与传统方法具有可比性。总体而言，我们的结果阐明了LLMs在WebShell检测任务上的成功与失败条件，提供了一种实用的解决方案，并指出了未来使基于LLM的检测更加可靠的研究方向。