Leveraging Large Language Models for Trustworthiness Assessment of Web Applications

The widespread adoption of web applications has made their security a critical concern and has increased the need for systematic ways to assess whether they can be considered trustworthy. However, "trust" assessment remains an open problem as existing techniques primarily focus on detecting known vulnerabilities or depend on manual evaluation, which limits their scalability; therefore, evaluating adherence to secure coding practices offers a complementary, pragmatic perspective by focusing on observable development behaviors. In practice, the identification and verification of secure coding practices are predominantly performed manually, relying on expert knowledge and code reviews, which is time-consuming, subjective, and difficult to scale. This study presents an empirical methodology to automate the trustworthiness assessment of web applications by leveraging Large Language Models (LLMs) to verify adherence to secure coding practices. We conduct a comparative analysis of prompt engineering techniques across five state-of-the-art LLMs, ranging from baseline zero-shot classification to prompts enriched with semantic definitions, structural context derived from call graphs, and explicit instructional guidance. Furthermore, we propose an extension of a hierarchical Quality Model (QM) based on the Logic Score of Preference (LSP), in which LLM outputs are used to populate the model's quality attributes and compute a holistic trustworthiness score. Experimental results indicate that excessive structural context can introduce noise, whereas rule-based instructional prompting improves assessment reliability. The resulting trustworthiness score allows discriminating between secure and vulnerable implementations, supporting the feasibility of using LLMs for scalable and context-aware trust assessment.

翻译：Web应用的广泛普及使其安全性成为关键关注点，并增加了系统性评估其可信度的需求。然而，“信任”评估仍然是一个开放性问题，因为现有技术主要侧重于检测已知漏洞或依赖人工评估，这限制了其可扩展性；因此，通过关注可观察的开发行为，对安全编码实践的遵循情况进行评估提供了一种互补且实用的视角。在实践中，安全编码实践的识别和验证主要依赖人工进行，依靠专家知识和代码审查，这耗时、主观且难以扩展。本研究提出了一种实证方法，通过利用大型语言模型（LLM）自动化Web应用的可信度评估，以验证对安全编码实践的遵循情况。我们对五种最先进的LLM进行了提示工程技术的比较分析，范围从基线零样本分类到包含语义定义、来自调用图的结构化上下文以及显式指令引导的提示。此外，我们提出了一种基于偏好逻辑评分（LSP）的分层质量模型（QM）扩展，其中使用LLM输出填充模型的质量属性，并计算整体可信度得分。实验结果表明，过多的结构化上下文可能会引入噪声，而基于规则的指令提示则能提高评估可靠性。最终的可信度得分能够区分安全与脆弱的实现，证明了使用LLM进行可扩展且上下文感知的可信评估的可行性。