Cross-site scripting (XSS) remains a persistent web security vulnerability, especially because obfuscation can change the surface form of a malicious payload while preserving its behavior. These transformations make it difficult for traditional and machine learning-based detection systems to reliably identify attacks. Existing approaches for generating obfuscated payloads often emphasize syntactic diversity, but they do not always ensure that the generated samples remain behaviorally valid. This paper presents a structured pipeline for generating and evaluating obfuscated XSS payloads using large language models (LLMs). The pipeline combines deterministic transformation techniques with LLM-based generation and uses a browser- based runtime evaluation procedure to compare payload behavior in a controlled execution environment. This allows generated samples to be assessed through observable runtime behavior rather than syntactic similarity alone. In the evaluation, an untuned baseline language model achieves a runtime behavior match rate of 0.15, while fine-tuning on behavior-preserving source-target obfuscation pairs improves the match rate to 0.22. Although this represents a measurable improvement, the results show that current LLMs still struggle to generate obfuscations that preserve observed runtime behavior. A downstream classifier evaluation further shows that adding generated payloads does not improve detection performance in this setting, although behavior- filtered generated samples can be incorporated without materially degrading performance. Overall, the study demonstrates both the promise and the limits of applying generative models to adversarial security data generation and emphasizes the importance of runtime behavior checks in improving the quality of generated data for downstream detection systems.
翻译:跨站脚本(XSS)仍是一种持续存在的Web安全漏洞,尤其是因为混淆技术可以在保留恶意载荷行为的同时改变其表面形式。这些变换使得传统及基于机器学习的检测系统难以可靠识别攻击。现有的混淆载荷生成方法通常强调句法多样性,但未必能保证生成的样本在行为上保持有效。本文提出了一种结构化流程,利用大型语言模型(LLM)生成并评估混淆XSS载荷。该流程将确定性变换技术与基于LLM的生成相结合,并通过基于浏览器的运行时评估程序,在受控执行环境中比较载荷行为。这使得生成的样本能够通过可观测的运行时行为(而非仅凭句法相似性)进行评估。在评估中,未经调优的基线语言模型达到0.15的运行时行为匹配率,而针对保持行为的源-目标混淆对进行微调后,匹配率提升至0.22。尽管这代表了可衡量的改进,但结果表明,当前LLM在生成保留观测运行时行为的混淆时仍存在困难。下游分类器评估进一步显示,在此场景下添加生成的载荷并未改善检测性能,尽管经行为过滤的生成样本可在不显著降低性能的情况下被纳入。总体而言,本研究展示了生成式模型在对抗性安全数据生成中的应用前景与局限,并强调运行时行为检查对提升下游检测系统生成数据质量的重要性。