A Robust Approach Towards Distinguishing Natural and Computer Generated Images using Multi-Colorspace fused and Enriched Vision Transformer

The works in literature classifying natural and computer generated images are mostly designed as binary tasks either considering natural images versus computer graphics images only or natural images versus GAN generated images only, but not natural images versus both classes of the generated images. Also, even though this forensic classification task of distinguishing natural and computer generated images gets the support of the new convolutional neural networks and transformer based architectures that can give remarkable classification accuracies, they are seen to fail over the images that have undergone some post-processing operations usually performed to deceive the forensic algorithms, such as JPEG compression, gaussian noise, etc. This work proposes a robust approach towards distinguishing natural and computer generated images including both, computer graphics and GAN generated images using a fusion of two vision transformers where each of the transformer networks operates in different color spaces, one in RGB and the other in YCbCr color space. The proposed approach achieves high performance gain when compared to a set of baselines, and also achieves higher robustness and generalizability than the baselines. The features of the proposed model when visualized are seen to obtain higher separability for the classes than the input image features and the baseline features. This work also studies the attention map visualizations of the networks of the fused model and observes that the proposed methodology can capture more image information relevant to the forensic task of classifying natural and generated images.

翻译：文献中区分自然图像与计算机生成图像的研究大多设计为二元任务，要么仅对比自然图像与计算机图形图像，要么仅对比自然图像与GAN生成图像，而非同时考虑自然图像与这两类生成图像。此外，尽管利用新型卷积神经网络和基于Transformer的架构（可提供显著分类精度）能够支持自然图像与计算机生成图像的取证分类任务，但这些方法在处理经过通常用于欺骗取证算法的后处理操作（如JPEG压缩、高斯噪声等）的图像时，仍表现失效。本文提出一种区分自然图像与计算机生成图像（包括计算机图形图像和GAN生成图像）的鲁棒性方法，该方法融合两个视觉Transformer，其中每个Transformer网络分别在不同色彩空间（RGB空间和YCbCr空间）中运行。与一组基线方法相比，所提方法实现了较高的性能增益，且鲁棒性和泛化能力均优于基线方法。当可视化所提模型特征时，其各类别间的可分离性优于输入图像特征和基线特征。本文还研究了融合模型中网络的注意力图可视化，并观察到所提方法能够捕获更多与自然图像与生成图像分类取证任务相关的图像信息。