Cybersecurity Threat Hunting and Vulnerability Analysis Using a Neo4j Graph Database of Open Source Intelligence

Open source intelligence is a powerful tool for cybersecurity analysts to gather information both for analysis of discovered vulnerabilities and for detecting novel cybersecurity threats and exploits. However the scale of information that is relevant for information security on the internet is always increasing, and is intractable for analysts to parse comprehensively. Therefore methods of condensing the available open source intelligence, and automatically developing connections between disparate sources of information, is incredibly valuable. In this research, we present a system which constructs a Neo4j graph database formed by shared connections between open source intelligence text including blogs, cybersecurity bulletins, news sites, antivirus scans, social media posts (e.g., Reddit and Twitter), and threat reports. These connections are comprised of possible indicators of compromise (e.g., IP addresses, domains, hashes, email addresses, phone numbers), information on known exploits and techniques (e.g., CVEs and MITRE ATT&CK Technique ID's), and potential sources of information on cybersecurity exploits such as twitter usernames. The construction of the database of potential IoCs is detailed, including the addition of machine learning and metadata which can be used for filtering of the data for a specific domain (for example a specific natural language) when needed. Examples of utilizing the graph database for querying connections between known malicious IoCs and open source intelligence documents, including threat reports, are shown. We show that this type of relationship querying can allow for more effective use of open source intelligence for threat hunting, malware family clustering, and vulnerability analysis.

翻译：开源情报是网络安全分析师收集已发现漏洞分析信息及检测新型网络安全威胁和漏洞利用的强大工具。然而，互联网上与信息安全相关的信息规模持续增长，分析师难以全面解析。因此，压缩可用开源情报并自动建立不同信息源之间关联的方法具有极高价值。本研究提出一个系统，该系统构建基于Neo4j图数据库，其节点由开源情报文本（包括博客、网络安全公告、新闻网站、防病毒扫描记录、社交媒体帖子（如Reddit和Twitter）及威胁报告）之间的共享关联组成。这些关联包含可能的失陷指标（如IP地址、域名、哈希值、电子邮件地址、电话号码）、已知漏洞利用和技术信息（如CVE编号和MITRE ATT&CK技术ID），以及网络安全漏洞利用的潜在信息来源（如Twitter用户名）。本文详细阐述了潜在失陷指标数据库的构建过程，包括添加机器学习机制和元数据，以便在需要时针对特定领域（例如特定自然语言）进行数据过滤。通过示例展示了如何利用该图数据库查询已知恶意失陷指标与开源情报文档（包括威胁报告）之间的关联。结果表明，此类关联查询能够更有效地利用开源情报进行威胁狩猎、恶意软件家族聚类及漏洞分析。