Leveraging the overfitting property of deep neural networks (DNNs) is trending in video delivery systems to enhance quality within bandwidth limits. Existing approaches transmit overfitted super-resolution (SR) model streams for low-resolution (LR) bitstreams, which are used to reconstruct high-resolution (HR) videos at the decoder. Although these approaches show promising results, the huge computational costs of training a large number of video frames limit their practical applications. To overcome this challenge, we propose an efficient patch sampling method named EPS for video SR network overfitting, which identifies the most valuable training patches from video frames. To this end, we first present two low-complexity Discrete Cosine Transform (DCT)-based spatial-temporal features to measure the complexity score of each patch directly. By analyzing the histogram distribution of these features, we then categorize all possible patches into different clusters and select training patches from the cluster with the highest spatial-temporal information. The number of sampled patches is adaptive based on the video content, addressing the trade-off between training complexity and efficiency. Our method reduces the number of patches for the training to 4% to 25%, depending on the resolution and number of clusters, while maintaining high video quality and significantly enhancing training efficiency. Compared to the state-of-the-art patch sampling method, EMT, our approach achieves an 83% decrease in overall run time.
翻译:利用深度神经网络(DNN)的过拟合特性,在带宽限制内提升视频质量,已成为视频传输系统的新趋势。现有方法为低分辨率(LR)码流传输过拟合的超分辨率(SR)模型流,在解码端用于重建高分辨率(HR)视频。尽管这些方法展现出良好前景,但训练海量视频帧所带来的巨大计算成本限制了其实际应用。为克服这一挑战,我们提出了一种名为EPS的高效图像块采样方法,用于视频超分辨率网络的过拟合训练,该方法能从视频帧中识别出最具价值的训练图像块。为此,我们首先提出了两种基于离散余弦变换(DCT)的低复杂度时空特征,用以直接度量每个图像块的复杂度得分。通过分析这些特征的直方图分布,我们将所有可能的图像块归类至不同的簇,并从具有最高时空信息量的簇中选取训练图像块。采样图像块的数量可根据视频内容自适应调整,从而在训练复杂度与效率之间取得平衡。我们的方法将训练所需的图像块数量减少至原始数量的4%到25%(具体取决于视频分辨率与簇的数量),同时保持了高视频质量并显著提升了训练效率。与最先进的图像块采样方法EMT相比,我们的方法实现了整体运行时间降低83%。