Cyber threat intelligence (CTI) is crucial in today's cybersecurity landscape, providing essential insights to understand and mitigate the ever-evolving cyber threats. The recent rise of Large Language Models (LLMs) have shown potential in this domain, but concerns about their reliability, accuracy, and hallucinations persist. While existing benchmarks provide general evaluations of LLMs, there are no benchmarks that address the practical and applied aspects of CTI-specific tasks. To bridge this gap, we introduce CTIBench, a benchmark designed to assess LLMs' performance in CTI applications. CTIBench includes multiple datasets focused on evaluating knowledge acquired by LLMs in the cyber-threat landscape. Our evaluation of several state-of-the-art models on these tasks provides insights into their strengths and weaknesses in CTI contexts, contributing to a better understanding of LLM capabilities in CTI.
翻译:网络威胁情报(CTI)在当今网络安全领域中至关重要,它提供了理解和缓解不断演变的网络威胁所需的关键洞察。近期兴起的大语言模型(LLMs)在该领域展现出潜力,但其可靠性、准确性和幻觉问题仍令人担忧。尽管现有基准为LLMs提供了通用评估,但尚无基准能够针对CTI特定任务的实际应用层面进行考量。为弥补这一空白,我们提出了CTIBench,这是一个专为评估LLMs在CTI应用中性能而设计的基准。CTIBench包含多个数据集,重点评估LLMs在网络威胁态势中所获取的知识。通过对多个前沿模型在这些任务上的评估,我们揭示了它们在CTI场景中的优势与不足,从而有助于更深入地理解LLMs在CTI领域的能力。