Manual vulnerability scoring, such as assigning Common Vulnerability Scoring System (CVSS) scores, is a resource-intensive process that is often influenced by subjective interpretation. This study investigates the potential of general-purpose large language models (LLMs), namely ChatGPT, Llama, Grok, DeepSeek, and Gemini, to automate this process by analyzing over 31{,}000 recent Common Vulnerabilities and Exposures (CVE) entries. The results show that LLMs substantially outperform the baseline on certain metrics (e.g., \textit{Availability Impact}), while offering more modest gains on others (e.g., \textit{Attack Complexity}). Moreover, model performance varies across both LLM families and individual CVSS metrics, with ChatGPT-5 attaining the highest precision. Our analysis reveals that LLMs tend to misclassify many of the same CVEs, and ensemble-based meta-classifiers only marginally improve performance. Further examination shows that CVE descriptions often lack critical context or contain ambiguous phrasing, which contributes to systematic misclassifications. These findings underscore the importance of enhancing vulnerability descriptions and incorporating richer contextual details to support more reliable automated reasoning and alleviate the growing backlog of CVEs awaiting triage.
翻译:手动漏洞评分(如分配通用漏洞评分系统(CVSS)分数)是一个资源密集型过程,且常受主观解释影响。本研究探讨了通用大型语言模型(LLMs),即ChatGPT、Llama、Grok、DeepSeek和Gemini,通过分析超过31,000条近期通用漏洞披露(CVE)条目来自动化此过程的潜力。结果显示,在某些指标(例如\textit{可用性影响})上,LLMs显著优于基线方法,而在其他指标(例如\textit{攻击复杂度})上提升较为有限。此外,模型性能在不同LLM系列和个体CVSS指标间存在差异,其中ChatGPT-5获得了最高的精确度。我们的分析表明,LLMs倾向于对许多相同的CVE条目进行错误分类,而基于集成的元分类器仅能略微提升性能。进一步研究发现,CVE描述常常缺乏关键上下文或包含模糊措辞,这导致了系统性的误分类。这些发现强调了增强漏洞描述并纳入更丰富的上下文细节的重要性,以支持更可靠的自动化推理,并缓解日益增长的待处理CVE积压。