Video coding has traditionally been developed to support services such as video streaming, videoconferencing, digital TV, and so on. The main intent was to enable human viewing of the encoded content. However, with the advances in deep neural networks (DNNs), encoded video is increasingly being used for automatic video analytics performed by machines. In applications such as automatic traffic monitoring, analytics such as vehicle detection, tracking and counting, would run continuously, while human viewing could be required occasionally to review potential incidents. To support such applications, a new paradigm for video coding is needed that will facilitate efficient representation and compression of video for both machine and human use in a scalable manner. In this manuscript, we introduce the first end-to-end learnable video codec that supports a machine vision task in its base layer, while its enhancement layer supports input reconstruction for human viewing. The proposed system is constructed based on the concept of conditional coding to achieve better compression gains. Comprehensive experimental evaluations conducted on four standard video datasets demonstrate that our framework outperforms both state-of-the-art learned and conventional video codecs in its base layer, while maintaining comparable performance on the human vision task in its enhancement layer. We will provide the implementation of the proposed system at www.github.com upon completion of the review process.
翻译:视频编码传统上用于支持视频流媒体、视频会议、数字电视等服务,其主要目标在于实现编码内容的人类观看。然而,随着深度神经网络(DNNs)的进步,编码视频正越来越多地被机器用于自动视频分析。在自动交通监控等应用中,车辆检测、跟踪和计数等分析任务需持续运行,而人类观看仅偶尔需要以审查潜在事件。为支持此类应用,需建立新的视频编码范式,以可伸缩方式高效实现面向机器和人类使用的视频表示与压缩。本文首次提出支持基础层机器视觉任务的端到端可学习视频编解码器,其增强层可支持人类观看的输入重建。所提系统基于条件编码概念构建,以获得更优的压缩增益。在四个标准视频数据集上的综合实验评估表明,本框架的基础层性能优于当前最先进的学习型与传统视频编解码器,同时增强层在人眼视觉任务中保持可比性能。审稿流程完成后,我们将于www.github.com提供所提系统的实现代码。