Neural Representations for Videos (NeRV) has emerged as a promising implicit neural representation (INR) approach for video analysis, which represents videos as neural networks with frame indexes as inputs. However, NeRV-based methods are time-consuming when adapting to a large number of diverse videos, as each video requires a separate NeRV model to be trained from scratch. In addition, NeRV-based methods spatially require generating a high-dimension signal (i.e., an entire image) from the input of a low-dimension timestamp, and a video typically consists of tens of frames temporally that have a minor change between adjacent frames. To improve the efficiency of video representation, we propose Meta Neural Representations for Videos, named MetaNeRV, a novel framework for fast NeRV representation for unseen videos. MetaNeRV leverages a meta-learning framework to learn an optimal parameter initialization, which serves as a good starting point for adapting to new videos. To address the unique spatial and temporal characteristics of video modality, we further introduce spatial-temporal guidance to improve the representation capabilities of MetaNeRV. Specifically, the spatial guidance with a multi-resolution loss aims to capture the information from different resolution stages, and the temporal guidance with an effective progressive learning strategy could gradually refine the number of fitted frames during the meta-learning process. Extensive experiments conducted on multiple datasets demonstrate the superiority of MetaNeRV for video representations and video compression.
翻译:神经视频表示(NeRV)已成为一种前景广阔的隐式神经表示方法,用于视频分析,其将视频表示为以帧索引为输入的神经网络。然而,基于NeRV的方法在适应大量多样化视频时耗时严重,因为每个视频都需要从头开始训练单独的NeRV模型。此外,基于NeRV的方法在空间上需要从低维时间戳输入生成高维信号(即整幅图像),而视频在时间上通常由数十帧组成,相邻帧之间变化微小。为提高视频表示的效率,我们提出元神经视频表示(MetaNeRV),这是一种用于未见视频的快速NeRV表示的新型框架。MetaNeRV利用元学习框架学习最优参数初始化,作为适应新视频的良好起点。针对视频模态独特的时空特性,我们进一步引入时空引导以提升MetaNeRV的表示能力。具体而言,采用多分辨率损失的空间引导旨在捕获不同分辨率阶段的信息,而采用渐进式学习策略的时间引导可在元学习过程中逐步优化拟合帧数。在多个数据集上的大量实验证明了MetaNeRV在视频表示与视频压缩方面的优越性。