CNNs for JPEGs: A Study in Computational Cost

Convolutional neural networks (CNNs) have achieved astonishing advances over the past decade, defining state-of-the-art in several computer vision tasks. CNNs are capable of learning robust representations of the data directly from the RGB pixels. However, most image data are usually available in compressed format, from which the JPEG is the most widely used due to transmission and storage purposes demanding a preliminary decoding process that have a high computational load and memory usage. For this reason, deep learning methods capable of learning directly from the compressed domain have been gaining attention in recent years. Those methods usually extract a frequency domain representation of the image, like DCT, by a partial decoding, and then make adaptation to typical CNNs architectures to work with them. One limitation of these current works is that, in order to accommodate the frequency domain data, the modifications made to the original model increase significantly their amount of parameters and computational complexity. On one hand, the methods have faster preprocessing, since the cost of fully decoding the images is avoided, but on the other hand, the cost of passing the images though the model is increased, mitigating the possible upside of accelerating the method. In this paper, we propose a further study of the computational cost of deep models designed for the frequency domain, evaluating the cost of decoding and passing the images through the network. We also propose handcrafted and data-driven techniques for reducing the computational complexity and the number of parameters for these models in order to keep them similar to their RGB baselines, leading to efficient models with a better trade off between computational cost and accuracy.

翻译：卷积神经网络（CNN）在过去十年取得了惊人进展，在多项计算机视觉任务中确立了最先进水平。CNN能够直接从RGB像素学习数据的鲁棒表示。然而，多数图像数据通常以压缩格式存储，其中JPEG因传输和存储需求成为最广泛使用的格式，其预处理解码过程需要高昂的计算负荷和内存占用。为此，近年来能够直接从压缩域学习的方法日益受到关注。这些方法通常通过部分解码提取图像的频域表示（如DCT），并对典型CNN架构进行适配以处理此类数据。现有工作的局限性在于，为了适配频域数据，对原始模型进行的结构修改显著增加了参数量和计算复杂度。一方面，由于避免了图像完全解码的开销，预处理速度得以提升；但另一方面，图像通过模型的计算成本随之增加，削弱了加速方法的潜在优势。本文对面向频域设计的深度学习模型的计算开销进行了深入研究，评估了解码与图像通过网络两个阶段的成本。我们提出基于人工设计与数据驱动的两类技术，用于降低此类模型的计算复杂度和参数量，使其保持与RGB基线模型相近的水平，最终构建出在计算开销与精度之间取得更优权衡的高效模型。