Extracting Implicit Neural Representations (INRs) on video data poses unique challenges due to the additional temporal dimension. In the context of videos, INRs have predominantly relied on a frame-only parameterization, which sacrifices the spatiotemporal continuity observed in pixel-level (spatial) representations. To mitigate this, we introduce Polynomial Neural Representation for Videos (PNeRV), a parameter-wise efficient, patch-wise INR for videos that preserves spatiotemporal continuity. PNeRV leverages the modeling capabilities of Polynomial Neural Networks to perform the modulation of a continuous spatial (patch) signal with a continuous time (frame) signal. We further propose a custom Hierarchical Patch-wise Spatial Sampling Scheme that ensures spatial continuity while retaining parameter efficiency. We also employ a carefully designed Positional Embedding methodology to further enhance PNeRV's performance. Our extensive experimentation demonstrates that PNeRV outperforms the baselines in conventional Implicit Neural Representation tasks like compression along with downstream applications that require spatiotemporal continuity in the underlying representation. PNeRV not only addresses the challenges posed by video data in the realm of INRs but also opens new avenues for advanced video processing and analysis.
翻译:在视频数据上提取隐式神经表示(INRs)由于额外的时间维度而面临独特挑战。在视频背景下,INRs主要依赖于仅针对帧的参数化方法,这牺牲了像素级(空间)表示中观察到的时空连续性。为缓解此问题,我们提出了用于视频的多项式神经表示(PNeRV),这是一种参数高效、基于视频块的INR方法,能够保持时空连续性。PNeRV利用多项式神经网络的建模能力,将连续空间(视频块)信号与连续时间(帧)信号进行调制。我们进一步提出了一种定制的分层视频块空间采样方案,在保持参数效率的同时确保空间连续性。我们还采用了精心设计的位置嵌入方法以进一步提升PNeRV的性能。大量实验表明,PNeRV在传统隐式神经表示任务(如压缩)以及需要底层表示具备时空连续性的下游应用中均优于基线方法。PNeRV不仅解决了INR领域中视频数据带来的挑战,还为高级视频处理与分析开辟了新途径。