Recent advances in text-to-image generative models provide the ability to generate high-quality images from short text descriptions. These foundation models, when pre-trained on billion-scale datasets, are effective for various downstream tasks with little or no further training. A natural question to ask is how such models may be adapted for image compression. We investigate several techniques in which the pre-trained models can be directly used to implement compression schemes targeting novel low rate regimes. We show how text descriptions can be used in conjunction with side information to generate high-fidelity reconstructions that preserve both semantics and spatial structure of the original. We demonstrate that at very low bit-rates, our method can significantly improve upon learned compressors in terms of perceptual and semantic fidelity, despite no end-to-end training.
翻译:近期文本到图像生成模型的进展使得从简短文本描述生成高质量图像成为可能。这些基础模型在数十亿级数据集上预训练后,可有效支持多种下游任务,且几乎无需进一步训练。一个自然的问题是,此类模型如何适应图像压缩任务。我们研究了几种技术,直接利用预训练模型实现面向新型低码率场景的压缩方案。我们展示了如何将文本描述与辅助信息结合,生成既保留原始语义又保持空间结构的高保真重建结果。实验表明,在极低比特率下,尽管无需端到端训练,我们的方法在感知保真度和语义保真度方面仍能显著优于学习型压缩器。