Video conferencing systems suffer from poor user experience when network conditions deteriorate because current video codecs simply cannot operate at extremely low bitrates. Recently, several neural alternatives have been proposed that reconstruct talking head videos at very low bitrates using sparse representations of each frame such as facial landmark information. However, these approaches produce poor reconstructions in scenarios with major movement or occlusions over the course of a call, and do not scale to higher resolutions. We design Gemino, a new neural compression system for video conferencing based on a novel high-frequency-conditional super-resolution pipeline. Gemino upsamples a very low-resolution version of each target frame while enhancing high-frequency details (e.g., skin texture, hair, etc.) based on information extracted from a single high-resolution reference image. We use a multi-scale architecture that runs different components of the model at different resolutions, allowing it to scale to resolutions comparable to 720p, and we personalize the model to learn specific details of each person, achieving much better fidelity at low bitrates. We implement Gemino atop aiortc, an open-source Python implementation of WebRTC, and show that it operates on 1024x1024 videos in real-time on a Titan X GPU, and achieves 2.2-5x lower bitrate than traditional video codecs for the same perceptual quality.
翻译:视频会议系统在网络条件恶化时用户体验较差,因为当前视频编解码器无法在极低比特率下运行。近年来,有研究提出利用面部地标信息等每帧的稀疏表示,在极低比特率下重建说话人头视频的神经替代方案。然而,当通话过程中出现大幅运动或被遮挡时,这些方法的重建效果较差,且无法扩展至更高分辨率。我们设计了Gemino——一种基于新颖的高频条件超分辨率流水线的视频会议神经压缩系统。Gemino对每个目标帧的极低分辨率版本进行上采样,同时根据从单张高分辨率参考图像提取的信息增强高频细节(如皮肤纹理、头发等)。我们采用多尺度架构,在不同分辨率下运行模型的不同组件,使其能扩展至720p级分辨率;同时通过个性化模型学习每个对象的特定细节,在低比特率下实现更优保真度。我们基于aiortc(WebRTC的开源Python实现)实现了Gemino,实验表明其在Titan X GPU上可实时处理1024×1024视频,并在相同感知质量下比特率比传统视频编解码器低2.2-5倍。