End-to-End Learnable Multi-Scale Feature Compression for VCM

The proliferation of deep learning-based machine vision applications has given rise to a new type of compression, so called video coding for machine (VCM). VCM differs from traditional video coding in that it is optimized for machine vision performance instead of human visual quality. In the feature compression track of MPEG-VCM, multi-scale features extracted from images are subject to compression. Recent feature compression works have demonstrated that the versatile video coding (VVC) standard-based approach can achieve a BD-rate reduction of up to 96% against MPEG-VCM feature anchor. However, it is still sub-optimal as VVC was not designed for extracted features but for natural images. Moreover, the high encoding complexity of VVC makes it difficult to design a lightweight encoder without sacrificing performance. To address these challenges, we propose a novel multi-scale feature compression method that enables both the end-to-end optimization on the extracted features and the design of lightweight encoders. The proposed model combines a learnable compressor with a multi-scale feature fusion network so that the redundancy in the multi-scale features is effectively removed. Instead of simply cascading the fusion network and the compression network, we integrate the fusion and encoding processes in an interleaved way. Our model first encodes a larger-scale feature to obtain a latent representation and then fuses the latent with a smaller-scale feature. This process is successively performed until the smallest-scale feature is fused and then the encoded latent at the final stage is entropy-coded for transmission. The results show that our model outperforms previous approaches by at least 52% BD-rate reduction and has $\times5$ to $\times27$ times less encoding time for object detection. It is noteworthy that our model can attain near-lossless task performance with only 0.002-0.003% of the uncompressed feature data size.

翻译：基于深度学习的机器视觉应用的激增催生了一种新型压缩技术，即面向机器的视频编码（VCM）。与传统视频编码不同，VCM针对机器视觉性能而非人类视觉质量进行优化。在MPEG-VCM特征压缩轨道中，从图像中提取的多尺度特征需进行压缩。近年来的特征压缩研究表明，基于通用视频编码（VVC）标准的方法相比MPEG-VCM特征基准可实现高达96%的BD-rate降低。然而，由于VVC并非针对提取的特征而是为自然图像设计，该方法仍非最优。此外，VVC的高编码复杂度使得在不牺牲性能的情况下设计轻量级编码器变得困难。为解决这些挑战，我们提出一种新颖的多尺度特征压缩方法，该方法既能实现提取特征的端到端优化，又能支持轻量级编码器设计。所提模型将可学习压缩器与多尺度特征融合网络相结合，从而有效消除多尺度特征中的冗余。不同于简单级联融合网络与压缩网络，我们以交错方式整合融合与编码过程。模型首先编码较大尺度特征以获得潜在表示，然后将该潜在表示与较小尺度特征融合。该过程依次执行至最小尺度特征融合，最后对最终阶段的编码潜在表示进行熵编码以传输。结果表明，我们的模型相比先前方法至少实现52%的BD-rate降低，且在目标检测任务中编码时间减少5至27倍。值得注意的是，该模型仅使用未压缩特征数据大小的0.002-0.003%即可实现近乎无损的任务性能。