File fragment classification (FFC) on small chunks of memory is essential in memory forensics and Internet security. Existing methods mainly treat file fragments as 1d byte signals and utilize the captured inter-byte features for classification, while the bit information within bytes, i.e., intra-byte information, is seldom considered. This is inherently inapt for classifying variable-length coding files whose symbols are represented as the variable number of bits. Conversely, we propose Byte2Image, a novel data augmentation technique, to introduce the neglected intra-byte information into file fragments and re-treat them as 2d gray-scale images, which allows us to capture both inter-byte and intra-byte correlations simultaneously through powerful convolutional neural networks (CNNs). Specifically, to convert file fragments to 2d images, we employ a sliding byte window to expose the neglected intra-byte information and stack their n-gram features row by row. We further propose a byte sequence \& image fusion network as a classifier, which can jointly model the raw 1d byte sequence and the converted 2d image to perform FFC. Experiments on FFT-75 dataset validate that our proposed method can achieve notable accuracy improvements over state-of-the-art methods in nearly all scenarios. The code will be released at https://github.com/wenyang001/Byte2Image.
翻译:文件碎片分类(FFC)在内存取证和互联网安全中至关重要。现有方法主要将文件碎片视为一维字节信号,并通过捕获的字节间特征进行分类,而字节内的比特信息(即字节内部信息)极少被考虑。这对于对符号以可变比特数表示的可变长度编码文件进行分类时具有本质上的不适用性。为此,我们提出一种新型数据增强技术Byte2Image,将此前被忽视的字节内信息引入文件碎片,并重新将其视为二维灰度图像,从而能够通过强大的卷积神经网络(CNN)同时捕获字节间和字节内的相关性。具体而言,为将文件碎片转换为二维图像,我们采用滑动字节窗口揭示被忽略的字节内信息,并逐行堆叠其n-gram特征。我们进一步提出一种字节序列与图像融合网络作为分类器,可联合建模原始一维字节序列和转换后的二维图像以执行FFC。在FFT-75数据集上的实验验证表明,我们的方法在几乎所有场景下均能显著提升分类准确率,超越现有最先进方法。相关代码将发布于https://github.com/wenyang001/Byte2Image。