Digital forensic investigations increasingly depend on preprocessing heterogeneous network evidence from intrusion detection systems, IoT devices, and enterprise traffic logs. Incompatible schemas and timestamp formats hinder evidence correlation and timeline reconstruction, while current ad hoc approaches offer no mechanism to verify consistency across runs or analysis, creating reproducibility gaps that challenge evidence admissibility. This paper introduces a deterministic forensic preprocessing framework that converts heterogeneous network datasets into a reproducible canonical form. The framework formalises three preprocessing transformations: schema normalisation, temporal normalisation, and provenance tracking. These transformations are specified using set-theoretic definitions and supported by four theorems establishing determinism, information preservation, and provenance completeness. A chunk-based architecture provides O(c) bounded memory. Empirical evaluation across UNSW-NB15, IoT-23, and TON_IoT demonstrates 100% output consistency across repeated runs, robust temporal normalisation completeness over heterogeneous timestamp formats, and scalable performance from millions to hundreds of millions of records.
翻译:数字取证调查日益依赖于对来自入侵检测系统、物联网设备和企业流量日志的异构网络证据进行预处理。不兼容的模式和时间戳格式阻碍了证据关联和时间线重建,而当前的非系统性方法缺乏验证跨运行或分析一致性的机制,造成了损害证据可采性的可重复性缺口。本文提出一种确定性取证预处理框架,将异构网络数据集转换为可复现的规范形式。该框架形式化了三种预处理变换:模式归一化、时间归一化和溯源追踪。这些变换采用集合论定义进行规范,并由四个定理支撑,确立了确定性、信息保持性和溯源完备性。一种基于分块的架构提供 O(c) 有界内存。在 UNSW-NB15、IoT-23 和 TON_IoT 上的实证评估表明,跨重复运行实现 100% 输出一致性、对异构时间戳格式的鲁棒时间归一化完备性,以及从数百万到数亿记录的扩展性能。