Neural fields, also known as coordinate-based or implicit neural representations, have shown a remarkable capability of representing, generating, and manipulating various forms of signals. For video representations, however, mapping pixel-wise coordinates to RGB colors has shown relatively low compression performance and slow convergence and inference speed. Frame-wise video representation, which maps a temporal coordinate to its entire frame, has recently emerged as an alternative method to represent videos, improving compression rates and encoding speed. While promising, it has still failed to reach the performance of state-of-the-art video compression algorithms. In this work, we propose FFNeRV, a novel method for incorporating flow information into frame-wise representations to exploit the temporal redundancy across the frames in videos inspired by the standard video codecs. Furthermore, we introduce a fully convolutional architecture, enabled by one-dimensional temporal grids, improving the continuity of spatial features. Experimental results show that FFNeRV yields the best performance for video compression and frame interpolation among the methods using frame-wise representations or neural fields. To reduce the model size even further, we devise a more compact convolutional architecture using the group and pointwise convolutions. With model compression techniques, including quantization-aware training and entropy coding, FFNeRV outperforms widely-used standard video codecs (H.264 and HEVC) and performs on par with state-of-the-art video compression algorithms.
翻译:神经场,又称基于坐标或隐式神经表示,在表示、生成和操控多种信号形式方面展现出显著能力。然而,对于视频表示,将像素坐标映射至RGB颜色的方法存在压缩性能较低、收敛和推理速度较慢的问题。逐帧视频表示(将时间坐标映射至完整帧)作为替代方案近期兴起,可提升压缩率和编码速度。尽管前景可观,该方法仍未达到最先进视频压缩算法的性能水平。本文提出FFNeRV——一种受标准视频编解码器启发、将流信息融入逐帧表示的新方法,旨在利用视频帧间的时间冗余性。此外,我们引入基于一维时间网格的全卷积架构,改善了空间特征的连续性。实验结果表明,FFNeRV在使用逐帧表示或神经场的方法中,于视频压缩与帧插值任务中取得了最优性能。为进一步缩减模型规模,我们利用分组卷积和逐点卷积设计了更紧凑的卷积架构。结合量化感知训练与熵编码等模型压缩技术,FFNeRV不仅超越广泛使用的标准视频编解码器(H.264和HEVC),更与最先进的视频压缩算法性能持平。