Implicit Neural Representations (INRs) have recently demonstrated impressive performance for video compression. However, since a separate INR must be overfit for each video, scaling to high-resolution videos while maintaining encoding efficiency remains a significant challenge. Hypernetwork-based approaches predict INR weights (hyponetworks) for unseen videos at high speeds, but with low quality, large compressed size, and prohibitive memory needs at higher resolutions. We address these fundamental limitations through three key contributions: (1) an approach that decomposes the weight prediction task spatially and temporally, by breaking short video segments into patch tubelets, to reduce the pretraining memory overhead by 20$\times$; (2) a residual-based storage scheme that captures only differences between consecutive segment representations, significantly reducing bitstream size; and (3) a temporal coherence regularization framework that encourages changes in the weight space to be correlated with video content. Our proposed method, TeCoNeRV, achieves substantial improvements of 2.47dB and 5.35dB PSNR over the baseline at 480p and 720p on UVG, with 36% lower bitrates and 1.5-3$\times$ faster encoding speeds. With our low memory usage, we are the first hypernetwork approach to demonstrate results at 480p, 720p and 1080p on UVG, HEVC and MCL-JCV. Our project page is available at https://namithap10.github.io/teconerv/ .
翻译:隐式神经表示(INRs)最近在视频压缩方面展现了令人印象深刻的性能。然而,由于必须为每个视频单独过拟合一个INR,因此在保持编码效率的同时扩展到高分辨率视频仍然是一个重大挑战。基于超网络的方法可以高速预测未见视频的INR权重(子网络),但存在质量低、压缩尺寸大以及在更高分辨率下内存需求过高的问题。我们通过三个关键贡献来解决这些根本性限制:(1)一种将权重预测任务在空间和时间上分解的方法,通过将短视频片段分解为补丁管,将预训练内存开销降低了20倍;(2)一种基于残差的存储方案,仅捕获连续片段表示之间的差异,显著减少了码流大小;(3)一个时间相干性正则化框架,鼓励权重空间的变化与视频内容相关联。我们提出的方法TeCoNeRV,在UVG数据集上,于480p和720p分辨率下,相比基线方法分别实现了2.47dB和5.35dB的PSNR显著提升,同时码率降低了36%,编码速度提高了1.5-3倍。凭借我们较低的内存使用量,我们是首个在UVG、HEVC和MCL-JCV数据集上展示480p、720p和1080p结果的超网络方法。我们的项目页面位于 https://namithap10.github.io/teconerv/ 。