基于分层向量量化潜变量的感知式低分辨率视频压缩 (Hierarchical Vector-Quantized Latents for Perceptual Low-Resolution Video Compression)

The exponential growth of video traffic has placed increasing demands on bandwidth and storage infrastructure, particularly for content delivery networks (CDNs) and edge devices. While traditional video codecs like H.264 and HEVC achieve high compression ratios, they are designed primarily for pixel-domain reconstruction and lack native support for machine learning-centric latent representations, limiting their integration into deep learning pipelines. In this work, we present a Multi-Scale Vector Quantized Variational Autoencoder (MS-VQ-VAE) designed to generate compact, high-fidelity latent representations of low-resolution video, suitable for efficient storage, transmission, and client-side decoding. Our architecture extends the VQ-VAE-2 framework to a spatiotemporal setting, introducing a two-level hierarchical latent structure built with 3D residual convolutions. The model is lightweight (approximately 18.5M parameters) and optimized for 64x64 resolution video clips, making it appropriate for deployment on edge devices with constrained compute and memory resources. To improve perceptual reconstruction quality, we incorporate a perceptual loss derived from a pre-trained VGG16 network. Trained on the UCF101 dataset using 2-second video clips (32 frames at 16 FPS), on the test set we achieve 25.96 dB PSNR and 0.8375 SSIM. On validation, our model improves over the single-scale baseline by 1.41 dB PSNR and 0.0248 SSIM. The proposed framework is well-suited for scalable video compression in bandwidth-sensitive scenarios, including real-time streaming, mobile video analytics, and CDN-level storage optimization.

翻译：视频流量的指数级增长对带宽和存储基础设施提出了日益增长的需求，尤其是在内容分发网络（CDNs）和边缘设备领域。尽管传统视频编解码器如H.264和HEVC能够实现高压缩比，但它们主要针对像素域重建设计，缺乏对以机器学习为中心的潜变量表示的原生支持，从而限制了其与深度学习流程的集成。本研究提出了一种多尺度向量量化变分自编码器（MS-VQ-VAE），旨在为低分辨率视频生成紧凑且高保真的潜变量表示，适用于高效存储、传输及客户端解码。该架构将VQ-VAE-2框架扩展至时空场景，通过三维残差卷积构建了双层分层潜变量结构。模型轻量化（约1850万参数）并针对64×64分辨率视频片段优化，适合部署在计算和内存资源受限的边缘设备。为提升感知重建质量，我们引入了基于预训练VGG16网络的感知损失。在UCF101数据集上使用2秒视频片段（32帧，16 FPS）进行训练，测试集上实现了25.96 dB PSNR和0.8375 SSIM。在验证集上，本模型较单尺度基线提升了1.41 dB PSNR和0.0248 SSIM。所提出的框架非常适用于带宽敏感场景下的可扩展视频压缩，包括实时流媒体、移动视频分析和CDN级存储优化。