Traditional image codecs emphasize signal fidelity and human perception, often at the expense of machine vision tasks. Deep learning methods have demonstrated promising coding performance by utilizing rich semantic embeddings optimized for both human and machine vision. However, these compact embeddings struggle to capture fine details such as contours and textures, resulting in imperfect reconstructions. Furthermore, existing learning-based codecs lack scalability. To address these limitations, this paper introduces a content-adaptive diffusion model for scalable image compression. The proposed method encodes fine textures through a diffusion process, enhancing perceptual quality while preserving essential features for machine vision tasks. The approach employs a Markov palette diffusion model combined with widely used feature extractors and image generators, enabling efficient data compression. By leveraging collaborative texture-semantic feature extraction and pseudo-label generation, the method accurately captures texture information. A content-adaptive Markov palette diffusion model is then applied to represent both low-level textures and high-level semantic content in a scalable manner. This framework offers flexible control over compression ratios by selecting intermediate diffusion states, eliminating the need for retraining deep learning models at different operating points. Extensive experiments demonstrate the effectiveness of the proposed framework in both image reconstruction and downstream machine vision tasks such as object detection, segmentation, and facial landmark detection, achieving superior perceptual quality compared to state-of-the-art methods.
翻译:传统图像编解码器强调信号保真度与人类感知,通常以牺牲机器视觉任务性能为代价。深度学习方法通过利用为人类和机器视觉共同优化的丰富语义嵌入,已展现出卓越的编码性能。然而,这些紧凑嵌入难以捕捉轮廓与纹理等细节特征,导致重建结果不够完善。此外,现有基于学习的编解码器缺乏可扩展性。为克服这些局限,本文提出一种面向可扩展图像压缩的内容自适应扩散模型。该方法通过扩散过程编码精细纹理,在提升感知质量的同时保留机器视觉任务所需的关键特征。该方案结合广泛使用的特征提取器与图像生成器,采用马尔可夫调色板扩散模型实现高效数据压缩。通过协同纹理-语义特征提取与伪标签生成技术,该方法能精准捕捉纹理信息。随后应用内容自适应的马尔可夫调色板扩散模型,以可扩展方式同时表征低层纹理与高层语义内容。该框架通过选择中间扩散状态实现对压缩比的灵活控制,无需在不同工作点重新训练深度学习模型。大量实验证明,所提框架在图像重建及下游机器视觉任务(如目标检测、分割与人脸关键点检测)中均表现优异,相较于现有先进方法实现了更卓越的感知质量。