Correctly identifying the type of file under examination is a critical part of a forensic investigation. The file type alone suggests the embedded content, such as a picture, video, manuscript, spreadsheet, etc. In cases where a system owner might desire to keep their files inaccessible or file type concealed, we propose using an adversarially-trained machine learning neural network to determine a file's true type even if the extension or file header is obfuscated to complicate its discovery. Our semi-supervised generative adversarial network (SGAN) achieved 97.6% accuracy in classifying files across 11 different types. We also compared our network against a traditional standalone neural network and three other machine learning algorithms. The adversarially-trained network proved to be the most precise file classifier especially in scenarios with few supervised samples available. Our implementation of a file classifier using an SGAN is implemented on GitHub (https://ksaintg.github.io/SGAN-File-Classier).
翻译:正确识别待检文件的类型是法证调查中的关键环节。文件类型本身即可揭示其嵌入内容,例如图片、视频、手稿、电子表格等。针对系统所有者可能试图隐藏文件或混淆文件类型的情况,我们提出采用对抗训练的机器学习神经网络来判定文件的真实类型,即使文件扩展名或文件头被故意遮蔽以增加识别难度。我们的半监督生成对抗网络(SGAN)在11种文件类型的分类中达到了97.6%的准确率。同时,我们将该网络与传统独立神经网络及其他三种机器学习算法进行了对比实验。结果表明,对抗训练网络在文件分类任务中精度最高,尤其适用于监督样本稀缺的场景。基于SGAN的文件分类器实现已发布在GitHub上(https://ksaintg.github.io/SGAN-File-Classier)。