TSTEM: A Cognitive Platform for Collecting Cyber Threat Intelligence in the Wild

The extraction of cyber threat intelligence (CTI) from open sources is a rapidly expanding defensive strategy that enhances the resilience of both Information Technology (IT) and Operational Technology (OT) environments against large-scale cyber-attacks. While previous research has focused on improving individual components of the extraction process, the community lacks open-source platforms for deploying streaming CTI data pipelines in the wild. To address this gap, the study describes the implementation of an efficient and well-performing platform capable of processing compute-intensive data pipelines based on the cloud computing paradigm for real-time detection, collecting, and sharing CTI from different online sources. We developed a prototype platform (TSTEM), a containerized microservice architecture that uses Tweepy, Scrapy, Terraform, ELK, Kafka, and MLOps to autonomously search, extract, and index IOCs in the wild. Moreover, the provisioning, monitoring, and management of the TSTEM platform are achieved through infrastructure as a code (IaC). Custom focus crawlers collect web content, which is then processed by a first-level classifier to identify potential indicators of compromise (IOCs). If deemed relevant, the content advances to a second level of extraction for further examination. Throughout this process, state-of-the-art NLP models are utilized for classification and entity extraction, enhancing the overall IOC extraction methodology. Our experimental results indicate that these models exhibit high accuracy (exceeding 98%) in the classification and extraction tasks, achieving this performance within a time frame of less than a minute. The effectiveness of our system can be attributed to a finely-tuned IOC extraction method that operates at multiple stages, ensuring precise identification of relevant information with low false positives.

翻译：从开源渠道提取网络威胁情报（CTI）是一种快速发展的防御策略，可增强信息技术（IT）与操作技术（OT）环境应对大规模网络攻击的韧性。尽管既有研究聚焦于改进提取流程的单个组件，但在野外部署流式CTI数据管道的开源平台仍然缺失。为填补这一空白，本研究实现了一个高效且性能优异的平台，该平台基于云计算范式处理计算密集型数据管道，能够从不同在线来源实时检测、采集并共享CTI。我们开发了原型平台（TSTEM）——一种采用Tweepy、Scrapy、Terraform、ELK、Kafka与MLOps的容器化微服务架构，可在野外自主搜索、提取并索引威胁指标（IOC）。此外，TSTEM平台的配置、监控与管理通过基础设施即代码（IaC）实现。定制化聚焦爬虫采集网页内容，经第一级分类器处理后识别潜在威胁指标。若判定内容相关，则进入第二级提取层进行深度分析。全流程采用先进自然语言处理（NLP）模型完成分类与实体提取，有效增强整体IOC提取方法论。实验结果表明，这些模型在分类与提取任务中展现出高精度（超过98%），且可在分钟级时间范围内达成该性能。本系统的有效性归功于多阶段精调IOC提取方法，该方法在降低误报率的同时确保相关信息的高精度识别。