Modern visual generative models acquire rich visual knowledge through large-scale training, yet existing visual representations (such as pixels, latents, or tokens) remain external to the model and cannot directly exploit this knowledge for compact storage or reuse. In this work, we introduce a new visual representation framework that encodes a signal as a function, which is parametrized by low-rank adaptations attached to a frozen visual generative model. Such implicit representations of visual signals, \textit{e.g.}, an 81-frame video, can further be hashed into a single compact vector, achieving strong perceptual video compression at extremely low bitrates. Beyond basic compression, the functional nature of this representation enables inference-time scaling and control, allowing additional refinement on the compression performance. More broadly, as the implicit representations directly act as a function of the generation process, this suggests a unified framework bridging visual compression and generation.
翻译:现代视觉生成模型通过大规模训练获取了丰富的视觉知识,但现有的视觉表示(如像素、潜变量或标记)仍独立于模型外部,无法直接利用这些知识实现紧凑存储或重复使用。本文提出一种新型视觉表示框架,将信号编码为函数形式,该函数通过附着于冻结视觉生成模型的低秩适应层进行参数化。此类视觉信号的隐式表示(例如包含81帧的视频)可进一步哈希为单一紧凑向量,在极低比特率下实现强大的感知视频压缩。除基础压缩功能外,该表示的函数特性支持推理时缩放与控制操作,可对压缩性能进行精细化调整。更广泛而言,由于隐式表示直接作为生成过程的函数,这为桥接视觉压缩与生成提供了统一框架。