Manual vulnerability scoring, such as assigning Common Vulnerability Scoring System (CVSS) scores, is a resource-intensive process that is often influenced by subjective interpretation. This study investigates the potential of general-purpose large language models (LLMs), namely ChatGPT, Llama, Grok, DeepSeek, and Gemini, to automate this process by analyzing over 31{,}000 recent Common Vulnerabilities and Exposures (CVE) entries. The results show that LLMs substantially outperform the baseline on certain metrics (e.g., \textit{Availability Impact}), while offering more modest gains on others (e.g., \textit{Attack Complexity}). Moreover, model performance varies across both LLM families and individual CVSS metrics, with ChatGPT-5 attaining the highest precision. Our analysis reveals that LLMs tend to misclassify many of the same CVEs, and ensemble-based meta-classifiers only marginally improve performance. Further examination shows that CVE descriptions often lack critical context or contain ambiguous phrasing, which contributes to systematic misclassifications. These findings underscore the importance of enhancing vulnerability descriptions and incorporating richer contextual details to support more reliable automated reasoning and alleviate the growing backlog of CVEs awaiting triage.
翻译:手动漏洞评分(如分配通用漏洞评分系统(CVSS)分数)是一个资源密集型过程,且常受主观解释影响。本研究通过分析超过31,000条近期通用漏洞披露(CVE)条目,探讨了通用大语言模型(LLMs),即ChatGPT、Llama、Grok、DeepSeek和Gemini,自动化此过程的潜力。结果表明,LLMs在某些指标(例如\textit{可用性影响})上显著优于基线,而在其他指标(例如\textit{攻击复杂度})上提升较为有限。此外,模型性能在不同LLM系列和各个CVSS指标间存在差异,其中ChatGPT-5达到了最高的精确度。我们的分析表明,LLMs倾向于错误分类许多相同的CVE,而基于集成的元分类器仅能略微提升性能。进一步检查发现,CVE描述常常缺乏关键上下文或包含模糊措辞,这导致了系统性的错误分类。这些发现强调了增强漏洞描述并纳入更丰富的上下文细节的重要性,以支持更可靠的自动化推理,并缓解日益增长的待处理CVE积压。