Video streaming is a fundamental Internet service, while the quality still cannot be guaranteed especially in poor network conditions such as bandwidth-constrained and remote areas. Existing works mainly work towards two directions: traditional pixel-codec streaming nearly approaches its limit and is hard to step further in compression; the emerging neural-enhanced or generative streaming usually fall short in latency and visual fidelity, hindering their practical deployment. Inspired by the recent success of vision foundation model (VFM), we strive to harness the powerful video understanding and processing capacities of VFM to achieve generalization, high fidelity and loss resilience for real-time video streaming with even higher compression rate. We present the first revolutionized paradigm that enables VFM-based end-to-end generative video streaming towards this goal. Specifically, Morphe employs joint training of visual tokenizers and variable-resolution spatiotemporal optimization under simulated network constraints. Additionally, a robust streaming system is constructed that leverages intelligent packet dropping to resist real-world network perturbations. Extensive evaluation demonstrates that Morphe achieves comparable visual quality while saving 62.5\% bandwidth compared to H.265, and accomplishes real-time, loss-resilient video delivery in challenging network environments, representing a milestone in VFM-enabled multimedia streaming solutions.
翻译:视频流传输是互联网的一项基础服务,但其质量尤其在带宽受限和偏远地区等恶劣网络条件下仍无法得到保证。现有研究主要朝两个方向发展:传统像素编解码流传输已接近其极限,难以在压缩方面取得进一步突破;新兴的神经增强或生成式流传输通常在延迟和视觉保真度方面存在不足,阻碍了其实际部署。受近期视觉基础模型(VFM)成功的启发,我们致力于利用VFM强大的视频理解与处理能力,在更高压缩率下实现实时视频流传输的泛化性、高保真度和丢包鲁棒性。为此,我们提出了首个革命性的范式,实现基于VFM的端到端生成式视频流传输。具体而言,Morphe采用视觉分词器联合训练与模拟网络约束下的可变分辨率时空优化。此外,构建了鲁棒的流传输系统,通过智能丢包机制抵御实际网络扰动。大量实验评估表明,与H.265相比,Morphe在实现相当视觉质量的同时节省了62.5%的带宽,并能在挑战性网络环境中完成实时、抗丢包的视频传输,这标志着VFM赋能的流媒体解决方案达到了新的里程碑。