Forensic Memory Analysis (FMA) and Virtual Machine Introspection (VMI) are critical tools for security in a virtualization-based approach. VMI and FMA involves using digital forensic methods to extract information from the system to identify and explain security incidents. A key challenge in both FMA and VMI is the "Semantic Gap", which is the difficulty of interpreting raw memory data without specialized tools and expertise. In this work, we investigate how a priori knowledge, metadata and engineered features can aid VMI and FMA, leveraging machine learning to automate information extraction and reduce the workload of forensic investigators. We choose OpenSSH as our use case to test different methods to extract high level structures. We also test our method on complete physical memory dumps to showcase the effectiveness of the engineered features. Our features range from basic statistical features to advanced graph-based representations using malloc headers and pointer translations. The training and testing are carried out on public datasets that we compare against already recognized baseline methods. We show that using metadata, we can improve the performance of the algorithm when there is very little training data and also quantify how having more data results in better generalization performance. The final contribution is an open dataset of physical memory dumps, totalling more than 1 TB of different memory state, software environments, main memory capacities and operating system versions. Our methods show that having more metadata boosts performance with all methods obtaining an F1-Score of over 80%. Our research underscores the possibility of using feature engineering and machine learning techniques to bridge the semantic gap.
翻译:取证内存分析与虚拟机自省是基于虚拟化安全方案中的关键工具。虚拟机自省与取证内存分析涉及运用数字取证方法从系统中提取信息,以识别和解释安全事件。两者面临的核心挑战在于“语义鸿沟”,即在缺乏专用工具与专业知识的情况下解读原始内存数据的困难。本研究探讨了先验知识、元数据与工程化特征如何辅助虚拟机自省与取证内存分析,通过运用机器学习实现信息提取自动化并减轻取证分析人员的工作负担。我们选择OpenSSH作为测试用例,验证提取高层结构的不同方法。同时,我们在完整的物理内存转储上测试了所提方法,以展示工程化特征的有效性。我们的特征体系涵盖从基础统计特征到利用malloc头部与指针转换的先进图表示方法。训练与测试均在公开数据集上进行,并与已有基准方法进行比较。研究表明,在训练数据极少时,利用元数据可提升算法性能,并能量化更多数据如何带来更好的泛化性能。最终贡献是一个总量超过1TB的物理内存转储开源数据集,涵盖不同内存状态、软件环境、主存容量及操作系统版本。实验表明,增加元数据可提升所有方法的性能,其F1分数均超过80%。本研究论证了利用特征工程与机器学习技术弥合语义鸿沟的可行性。