We present PFP, a neural network structure to compress long videos into short contexts, with an explicit pretraining objective to preserve the high-frequency details of single frames at arbitrary temporal positions. The baseline model can compress a 20-second video into a context at about 5k length, where random frames can be retrieved with perceptually preserved appearances. Such pretrained models can be directly fine-tuned as memory encoders for autoregressive video models, enabling long history memory with low context cost and relatively low fidelity loss. We evaluate the framework with ablative settings and discuss the trade-offs of possible neural architecture designs.
翻译:本文提出PFP,一种将长视频压缩为短上下文的神经网络结构,其预训练目标明确要求保留任意时间位置上单帧图像的高频细节。基准模型可将20秒视频压缩至约5k长度的上下文,其中随机帧能够以感知保持的外观被检索。此类预训练模型可直接微调为自回归视频模型的记忆编码器,实现以较低上下文成本和相对较低保真度损失的长历史记忆。我们通过消融实验评估该框架,并探讨可能神经网络架构设计的权衡关系。