CTI-REALM (Cyber Threat Real World Evaluation and LLM Benchmarking) is a benchmark designed to evaluate AI agents' ability to interpret cyber threat intelligence (CTI) and develop detection rules. The benchmark provides a realistic environment that replicates the security analyst workflow. This enables agents to examine CTI reports, execute queries, understand schema structures, and construct detection rules. Evaluation involves emulated attacks of varying complexity across Linux systems, cloud platforms, and Azure Kubernetes Service (AKS), with ground truth data for accurate assessment. Agent performance is measured through both final detection results and trajectory-based rewards that capture decision-making effectiveness. This work demonstrates the potential of AI agents to support labor-intensive aspects of detection engineering. Our comprehensive evaluation of 16 frontier models shows that Claude Opus 4.6 (High) achieves the highest overall reward (0.637), followed by Claude Opus 4.5 (0.624) and the GPT-5 family. An ablation study confirms that CTI-specific tools significantly improve agent performance, a variance analysis across repeated runs demonstrates result stability. Finally, a memory augmentation study shows that seeded context can close 33\% of the performance gap between smaller and larger models.
翻译:CTI-REALM(网络威胁现实世界评估与大语言模型基准测试)是一个旨在评估AI智能体解读网络威胁情报(CTI)并制定检测规则能力的基准。该基准提供了一个模拟安全分析师工作流程的真实环境,使智能体能够检查CTI报告、执行查询、理解模式结构并构建检测规则。评估涉及在Linux系统、云平台和Azure Kubernetes服务(AKS)上模拟不同复杂程度的攻击,并提供真实数据以进行准确评估。智能体性能通过最终检测结果和基于轨迹的奖励(用于捕捉决策有效性)来衡量。这项工作展示了AI智能体在支持检测工程中劳动密集型环节的潜力。我们对16个前沿模型的综合评估表明,Claude Opus 4.6(High)获得了最高的总体奖励(0.637),其次是Claude Opus 4.5(0.624)和GPT-5系列模型。一项消融研究证实,CTI专用工具能显著提升智能体性能;跨多次运行的方差分析证明了结果的稳定性。最后,一项记忆增强研究表明,植入上下文可以弥合较小模型与较大模型之间33%的性能差距。