Compression artifacts from standard video codecs often degrade perceptual quality. We propose a lightweight, semantic-aware pre-processing framework that enhances perceptual fidelity by selectively addressing these distortions. Our method integrates semantic embeddings from a vision-language model into an efficient convolutional architecture, prioritizing the preservation of perceptually significant structures. The model is trained end-to-end with a differentiable codec proxy, enabling it to mitigate artifacts from various standard codecs without modifying the existing video pipeline. During inference, the codec proxy is discarded, and SCENE operates as a standalone pre-processor, enabling real-time performance. Experiments on high-resolution benchmarks show improved performance over baselines in both objective (MS-SSIM) and perceptual (VMAF) metrics, with notable gains in preserving detailed textures within salient regions. Our results show that semantic-guided, codec-aware pre-processing is an effective approach for enhancing compressed video streams.
翻译:标准视频编解码器产生的压缩伪影常导致感知质量下降。本文提出一种轻量级、语义感知的预处理框架,通过选择性处理此类失真来提升感知保真度。该方法将视觉语言模型生成的语义嵌入集成到高效卷积架构中,优先保持感知显著的结构特征。模型通过可微分编解码代理进行端到端训练,使其能够在无需修改现有视频处理流程的前提下,缓解多种标准编解码器产生的伪影。在推理阶段,编解码代理被移除,SCENE作为独立预处理器运行,实现实时处理性能。在高分辨率基准测试上的实验表明,该方法在客观指标(MS-SSIM)与感知指标(VMAF)上均优于基线模型,且在显著区域细节纹理保持方面表现出显著优势。研究结果证明,基于语义引导且感知编解码特性的预处理是增强压缩视频流的有效途径。