As video transmission increasingly serves machine vision systems (MVS) instead of human vision systems (HVS), video coding for machines (VCM) has become a critical research topic. Existing VCM methods often bind codecs to specific downstream models, requiring retraining or supervised data, thus limiting generalization in multi-task scenarios. Recently, unified VCM frameworks have employed visual backbones (VB) and visual foundation models (VFM) to support multiple video understanding tasks with a single codec. They mainly utilize VB/VFM to maintain semantic consistency or suppress non-semantic information, but seldom explore how to directly link video coding with understanding under VB/VFM guidance. Hence, we propose a Symmetric Entropy-Constrained Video Coding framework for Machines (SEC-VCM). It establishes a symmetric alignment between the video codec and VB, allowing the codec to leverage VB's representation capabilities to preserve semantics and discard MVS-irrelevant information. Specifically, a bi-directional entropy-constraint (BiEC) mechanism ensures symmetry between the process of video decoding and VB encoding by suppressing conditional entropy. This helps the codec to explicitly handle semantic information beneficial to MVS while squeezing useless information. Furthermore, a semantic-pixel dual-path fusion (SPDF) module injects pixel-level priors into the final reconstruction. Through semantic-pixel fusion, it suppresses artifacts harmful to MVS and improves machine-oriented reconstruction quality. Experimental results on classical video understanding tasks and MLLM-based tasks show SOTA rate-task performance. It achieves significant bitrate savings over H.266/VVC reference software VTM on video instance segmentation (37.4%), video object segmentation (29.8%), object detection (46.2%), multiple object tracking (44.9%), and MLLM-based video grounding (97.6%).
翻译:随着视频传输日益服务于机器视觉系统(MVS)而非人类视觉系统(HVS),面向机器的视频编码(VCM)已成为关键研究课题。现有VCM方法往往将编解码器与特定下游模型绑定,需要重新训练或监督数据,从而限制了多任务场景的泛化能力。近期,统一VCM框架采用视觉骨干网络(VB)和视觉基础模型(VFM),通过单一编解码器支持多种视频理解任务。这些方法主要利用VB/VFM保持语义一致性或抑制非语义信息,但鲜有探索如何直接建立VB/VFM引导下视频编码与理解之间的关联。为此,我们提出面向机器的对称熵约束视频编码框架(SEC-VCM)。该框架在视频编解码器与VB之间建立对称对齐,使编解码器能够利用VB的表征能力保留语义并丢弃与MVS无关的信息。具体而言,双向熵约束(BiEC)机制通过抑制条件熵确保视频解码与VB编码过程的对称性,帮助编解码器显式处理有益于MVS的语义信息同时压缩无用信息。此外,语义-像素双路径融合(SPDF)模块将像素级先验注入最终重建结果。通过语义-像素融合,该模块抑制对MVS有害的伪影,提升面向机器的重建质量。在经典视频理解任务与基于MLLM的任务上的实验结果表明,该方法实现了率失真性能最优(SOTA)。与H.266/VVC参考软件VTM相比,其在视频实例分割(37.4%)、视频对象分割(29.8%)、目标检测(46.2%)、多目标跟踪(44.9%)及基于MLLM的视频定位(97.6%)等任务上实现了显著的比特率节省。