The accelerated proliferation of visual content and the rapid development of machine vision technologies bring significant challenges in delivering visual data on a gigantic scale, which shall be effectively represented to satisfy both human and machine requirements. In this work, we investigate how hierarchical representations derived from the advanced generative prior facilitate constructing an efficient scalable coding paradigm for human-machine collaborative vision. Our key insight is that by exploiting the StyleGAN prior, we can learn three-layered representations encoding hierarchical semantics, which are elaborately designed into the basic, middle, and enhanced layers, supporting machine intelligence and human visual perception in a progressive fashion. With the aim of achieving efficient compression, we propose the layer-wise scalable entropy transformer to reduce the redundancy between layers. Based on the multi-task scalable rate-distortion objective, the proposed scheme is jointly optimized to achieve optimal machine analysis performance, human perception experience, and compression ratio. We validate the proposed paradigm's feasibility in face image compression. Extensive qualitative and quantitative experimental results demonstrate the superiority of the proposed paradigm over the latest compression standard Versatile Video Coding (VVC) in terms of both machine analysis as well as human perception at extremely low bitrates ($<0.01$ bpp), offering new insights for human-machine collaborative compression.
翻译:视觉内容的加速增长与机器视觉技术的快速发展带来了大规模视觉数据传递的严峻挑战,这类数据需被有效表征以同时满足人与机器的需求。本文探究如何利用先进生成先验导出的层次化表征,构建面向人机协同视觉的高效可扩展编码范式。关键洞察在于:通过利用StyleGAN先验,可学习编码层次化语义的三层表征,这些表征被精心设计为基础层、中间层和增强层,以渐进方式支持机器智能与人类视觉感知。为实现高效压缩,我们提出逐层可扩展熵变换器以减少层间冗余。基于多任务可扩展率失真目标,所提方案经过联合优化以在机器分析性能、人类感知体验与压缩比之间取得最优平衡。我们验证了该范式在人脸图像压缩中的可行性。大量定性与定量实验结果表明,在极低码率(<0.01 bpp)条件下,所提范式在机器分析及人类感知两方面均优于最新压缩标准VVC,为人机协同压缩提供了新思路。