Video Classification With CNNs: Using The Codec As A Spatio-Temporal Activity Sensor

We investigate video classification via a two-stream convolutional neural network (CNN) design that directly ingests information extracted from compressed video bitstreams. Our approach begins with the observation that all modern video codecs divide the input frames into macroblocks (MBs). We demonstrate that selective access to MB motion vector (MV) information within compressed video bitstreams can also provide for selective, motion-adaptive, MB pixel decoding (a.k.a., MB texture decoding). This in turn allows for the derivation of spatio-temporal video activity regions at extremely high speed in comparison to conventional full-frame decoding followed by optical flow estimation. In order to evaluate the accuracy of a video classification framework based on such activity data, we independently train two CNN architectures on MB texture and MV correspondences and then fuse their scores to derive the final classification of each test video. Evaluation on two standard datasets shows that the proposed approach is competitive to the best two-stream video classification approaches found in the literature. At the same time: (i) a CPU-based realization of our MV extraction is over 977 times faster than GPU-based optical flow methods; (ii) selective decoding is up to 12 times faster than full-frame decoding; (iii) our proposed spatial and temporal CNNs perform inference at 5 to 49 times lower cloud computing cost than the fastest methods from the literature.

翻译：我们通过直接从压缩视频位流中提取的信息的双流神经神经网络(CNN)设计来调查视频分类。我们的方法首先是观察所有现代视频代码将输入框架分为宏观区块(MBs ) 。我们证明,在压缩视频位流中选择性地获取MB运动矢量(MV)信息也可以提供选择性的、运动适应性的、MB像素解码(a.k.a.MB Texture decoding ) 。这反过来又允许以极快的速度生成Spotio-时空视频活动区域,与常规全机解码区域相比较,然后进行光学流量估计。为了评估基于此类活动的数据的视频分类框架的准确性,我们独立地培训了两个CNN关于MB纹理和MV通信的系统结构,然后将其分数结合到每部测试视频的最终分类(a.k.a.k.a.a.b.b.b.b.b.b.b.b.b.b.d.b.b.b.b.b.b.b.b.b.b.b.b.b.b.b.b.b.b.b.b.b.b.b.b.b.b.d.d.d.b.b.d.d.b.d.b.b.b.b.b.b.d.d.b.b.b.d.b.d.b.b.b.b.b.d.d.b.b.b.b.d.d.d.d.d.d.d.b.b.b.b.d.d.d.d.b.b.b.b.d.d.d.d.b.b.b.b.b.b.b.b.b.b.b.b.b.b.b.b.b.b.b.b.b.b.b.b.b.b.b.b.b.b.b.b.b.b.b.b.b.b.b.b.b.d.d.d.b.b.b.b.b.b.b.b.b.b.b.b.b.d.d.d.d