CyberThreat-Eval: Can Large Language Models Automate Real-World Threat Research?

Analyzing Open Source Intelligence (OSINT) from large volumes of data is critical for drafting and publishing comprehensive CTI reports. This process usually follows a three-stage workflow -- triage, deep search and TI drafting. While Large Language Models (LLMs) offer a promising route toward automation, existing benchmarks still have limitations. These benchmarks often consist of tasks that do not reflect real-world analyst workflows. For example, human analysts rarely receive tasks in the form of multiple-choice questions. Also, existing benchmarks often rely on model-centric metrics that emphasize lexical overlap rather than actionable, detailed insights essential for security analysts. Moreover, they typically fail to cover the complete three-stage workflow. To address these issues, we introduce CyberThreat-Eval, which is collected from the daily CTI workflow of a world-leading company. This expert-annotated benchmark assesses LLMs on practical tasks across all three stages as mentioned above. It utilizes analyst-centric metrics that measure factual accuracy, content quality, and operational costs. Our evaluation using this benchmark reveals important insights into the limitations of current LLMs. For example, LLMs often lack the nuanced expertise required to handle complex details and struggle to distinguish between correct and incorrect information. To address these challenges, the CTI workflow incorporates both external ground-truth databases and human expert knowledge. TRA allows human experts to iteratively provide feedback for continuous improvement. The code is available at \href{https://github.com/xschen-beb/CyberThreat-Eval}{\texttt{GitHub}} and \href{https://huggingface.co/datasets/xse/CyberThreat-Eval}{\texttt{HuggingFace}}.

翻译：从海量数据中分析开源情报（OSINT）对于起草和发布全面的网络威胁情报（CTI）报告至关重要。该过程通常遵循一个三阶段工作流——事件分诊、深度搜索和威胁情报起草。虽然大型语言模型（LLMs）为实现自动化提供了一条有前景的路径，但现有基准测试仍存在局限性。这些基准测试通常包含的任务并不能反映现实世界分析师的真实工作流程。例如，人类分析师很少以多项选择题的形式接收任务。此外，现有基准测试通常依赖以模型为中心的指标，这些指标强调词汇重叠，而非对安全分析师至关重要的、可付诸行动的详细见解。更重要的是，它们通常未能覆盖完整的三阶段工作流。为解决这些问题，我们引入了CyberThreat-Eval，该数据集收集自一家世界领先公司的日常CTI工作流。这个由专家标注的基准测试在上述所有三个阶段对LLMs进行实际任务评估。它采用以分析师为中心的指标，衡量事实准确性、内容质量和操作成本。我们使用该基准测试进行的评估揭示了当前LLMs局限性的重要见解。例如，LLMs通常缺乏处理复杂细节所需的细致专业知识，并且难以区分正确与错误信息。为应对这些挑战，CTI工作流结合了外部真实数据库和人类专家知识。TRA允许人类专家迭代提供反馈以实现持续改进。代码可在 \href{https://github.com/xschen-beb/CyberThreat-Eval}{\texttt{GitHub}} 和 \href{https://huggingface.co/datasets/xse/CyberThreat-Eval}{\texttt{HuggingFace}} 获取。