Implicit neural representations (INRs) embed various signals into networks. They have gained attention in recent years because of their versatility in handling diverse signal types. For videos, INRs achieve video compression by embedding video signals into networks and compressing them. Conventional methods use an index that expresses the time of the frame or the features extracted from the frame as inputs to the network. The latter method provides greater expressive capability as the input is specific to each video. However, the features extracted from frames often contain redundancy, which contradicts the purpose of video compression. Moreover, since frame time information is not explicitly provided to the network, learning the relationships between frames is challenging. To address these issues, we aim to reduce feature redundancy by extracting features based on the high-frequency components of the frames. In addition, we use feature differences between adjacent frames in order for the network to learn frame relationships smoothly. We propose a video representation method that uses the high-frequency components of frames and the differences in features between adjacent frames. The experimental results show that our method outperforms the existing HNeRV method in 90 percent of the videos.
翻译:隐式神经表示(INRs)将各类信号嵌入网络之中。近年来,因其处理多种信号类型的灵活性而受到关注。对于视频,INRs通过将视频信号嵌入网络并进行压缩来实现视频压缩。传统方法使用表示帧时间或从帧中提取的特征作为网络输入的索引。后一种方法由于输入针对每个视频具有特异性,因此提供了更强的表达能力。然而,从帧中提取的特征通常包含冗余,这与视频压缩的目的相悖。此外,由于未向网络明确提供帧时间信息,学习帧间关系具有挑战性。为解决这些问题,我们旨在通过基于帧的高频分量提取特征来减少特征冗余。此外,我们利用相邻帧之间的特征差异,以使网络能够平滑地学习帧间关系。我们提出了一种利用帧的高频分量及相邻帧间特征差异的视频表示方法。实验结果表明,我们的方法在90%的视频中优于现有的HNeRV方法。