Video conferencing systems suffer from poor user experience when network conditions deteriorate because current video codecs simply cannot operate at extremely low bitrates. Recently, several neural alternatives have been proposed that reconstruct talking head videos at very low bitrates using sparse representations of each frame such as facial landmark information. However, these approaches produce poor reconstructions in scenarios with major movement or occlusions over the course of a call, and do not scale to higher resolutions. We design Gemino, a new neural compression system for video conferencing based on a novel high-frequency-conditional super-resolution pipeline. Gemino upsamples a very low-resolution version of each target frame while enhancing high-frequency details (e.g., skin texture, hair, etc.) based on information extracted from a single high-resolution reference image. We use a multi-scale architecture that runs different components of the model at different resolutions, allowing it to scale to resolutions comparable to 720p, and we personalize the model to learn specific details of each person, achieving much better fidelity at low bitrates. We implement Gemino atop aiortc, an open-source Python implementation of WebRTC, and show that it operates on 1024x1024 videos in real-time on a Titan X GPU, and achieves 2.2-5x lower bitrate than traditional video codecs for the same perceptual quality.
翻译:视频会议系统在网络条件恶化时用户体验较差,因为现有视频编解码器无法在极低比特率下正常工作。近年来,多项神经替代方案被提出,通过利用每帧的稀疏表示(如面部关键点信息)在极低比特率下重建说话人头部视频。然而,这些方法在通话过程中存在大幅运动或遮挡时会产生较差的重建效果,且无法扩展至更高分辨率。我们设计了Gemino——一种基于新型高频条件超分辨率流水线的视频会议神经压缩系统。Gemino对每个目标帧的极低分辨率版本进行上采样,同时根据从单张高分辨率参考图像中提取的信息增强高频细节(如皮肤纹理、头发等)。我们采用多尺度架构,使模型的不同组件在不同分辨率下运行,从而支持扩展到720p级别分辨率,并通过个性化模型学习每个人的具体细节,在低比特率下实现更优保真度。我们在WebRTC的开源Python实现aiortc上实现了Gemino,结果表明其在Titan X GPU上可实时处理1024×1024视频,且在相同感知质量下比特率比传统视频编解码器低2.2-5倍。