SCALED : Surrogate-gradient for Codec-Aware Learning of Downsampling in ABR Streaming

The rapid growth in video consumption has introduced significant challenges to modern streaming architectures. Over-the-Top (OTT) video delivery now predominantly relies on Adaptive Bitrate (ABR) streaming, which dynamically adjusts bitrate and resolution based on client-side constraints such as display capabilities and network bandwidth. This pipeline typically involves downsampling the original high-resolution content, encoding and transmitting it, followed by decoding and upsampling on the client side. Traditionally, these processing stages have been optimized in isolation, leading to suboptimal end-to-end rate-distortion (R-D) performance. The advent of deep learning has spurred interest in jointly optimizing the ABR pipeline using learned resampling methods. However, training such systems end-to-end remains challenging due to the non-differentiable nature of standard video codecs, which obstructs gradient-based optimization. Recent works have addressed this issue using differentiable proxy models, based either on deep neural networks or hybrid coding schemes with differentiable components such as soft quantization, to approximate the codec behavior. While differentiable proxy codecs have enabled progress in compression-aware learning, they remain approximations that may not fully capture the behavior of standard, non-differentiable codecs. To our knowledge, there is no prior evidence demonstrating the inefficiencies of using standard codecs during training. In this work, we introduce a novel framework that enables end-to-end training with real, non-differentiable codecs by leveraging data-driven surrogate gradients derived from actual compression errors. It facilitates the alignment between training objectives and deployment performance. Experimental results show a 5.19\% improvement in BD-BR (PSNR) compared to codec-agnostic training approaches, consistently across the entire rate-distortion convex hull spanning multiple downsampling ratios.

翻译：视频消费的快速增长给现代流媒体架构带来了重大挑战。目前，OTT视频传输主要依赖于自适应码率流媒体技术，该技术根据客户端显示能力和网络带宽等约束条件动态调整码率和分辨率。该处理流程通常包括对原始高分辨率内容进行下采样、编码与传输，随后在客户端进行解码和上采样。传统上，这些处理阶段被孤立地优化，导致端到端率失真性能未能达到最优。深度学习的兴起推动了基于可学习重采样方法对ABR流程进行联合优化的研究。然而，由于标准视频编解码器的不可微分特性阻碍了基于梯度的优化，端到端训练此类系统仍面临挑战。近期研究通过使用可微分代理模型来解决此问题，这些模型基于深度神经网络或包含可微分组件（如软量化）的混合编码方案，以近似编解码器行为。尽管可微分代理编解码器推动了压缩感知学习的进展，但其仍属于近似模型，可能无法完全捕捉标准不可微分编解码器的真实行为。据我们所知，目前尚无证据表明在训练中使用标准编解码器会导致效率损失。本研究提出一种新颖框架，通过利用从实际压缩误差推导出的数据驱动代理梯度，实现了与真实不可微分编解码器的端到端训练。该框架促进了训练目标与部署性能的对齐。实验结果表明，在跨越多个下采样比率的完整率失真凸包上，相较于编解码器无关的训练方法，本框架在BD-BR指标上实现了5.19%的性能提升。