Improving Transfer Learning with a Dual Image and Video Transformer for Multi-label Movie Trailer Genre Classification

In this paper, we study the transferability of ImageNet spatial and Kinetics spatio-temporal representations to multi-label Movie Trailer Genre Classification (MTGC). In particular, we present an extensive evaluation of the transferability of ConvNet and Transformer models pretrained on ImageNet and Kinetics to Trailers12k, a new manually-curated movie trailer dataset composed of 12,000 videos labeled with 10 different genres and associated metadata. We analyze different aspects that can influence transferability, such as frame rate, input video extension, and spatio-temporal modeling. In order to reduce the spatio-temporal structure gap between ImageNet/Kinetics and Trailers12k, we propose Dual Image and Video Transformer Architecture (DIViTA), which performs shot detection so as to segment the trailer into highly correlated clips, providing a more cohesive input for pretrained backbones and improving transferability (a 1.83% increase for ImageNet and 3.75% for Kinetics). Our results demonstrate that representations learned on either ImageNet or Kinetics are comparatively transferable to Trailers12k. Moreover, both datasets provide complementary information that can be combined to improve classification performance (a 2.91% gain compared to the top single pretraining). Interestingly, using lightweight ConvNets as pretrained backbones resulted in only a 3.46% drop in classification performance compared with the top Transformer while requiring only 11.82% of its parameters and 0.81% of its FLOPS.

翻译：本文研究了ImageNet空间表示与Kinetics时空表示向多标签电影预告片分类（MTGC）任务的迁移性能。我们特别评估了在ImageNet和Kinetics上预训练的ConvNet与Transformer模型向Trailers12k（一个全新手工标注的电影预告片数据集，包含12,000个视频，标注有10种类型及元数据）的迁移效果，并分析了影响迁移性能的多种因素，如帧率、输入视频扩展方式及时空建模策略。为缩小ImageNet/Kinetics与Trailers12k之间的时空结构差异，我们提出双图像与视频Transformer架构（DIViTA），该架构通过镜头检测将预告片分割为高度相关的片段，为预训练骨干网络提供更连贯的输入，从而提升迁移性能（ImageNet提升1.83%，Kinetics提升3.75%）。实验结果表明，在ImageNet或Kinetics上学习的表示对Trailers12k具有相当的迁移能力，且两个数据集提供的互补信息可融合以提升分类性能（相比最优单一预训练方式提升2.91%）。值得注意的是，采用轻量级ConvNet作为预训练骨干网络时，相比顶级Transformer模型仅产生3.46%的分类性能下降，但其参数量仅为后者的11.82%，计算量仅为0.81%。

相关内容

ImageNet (数据集)

关注 22

ImageNet项目是一个用于视觉对象识别软件研究的大型可视化数据库。超过1400万的图像URL被ImageNet手动注释，以指示图片中的对象;在至少一百万个图像中，还提供了边界框。ImageNet包含2万多个类别; [2]一个典型的类别，如“气球”或“草莓”，包含数百个图像。第三方图像URL的注释数据库可以直接从ImageNet免费获得;但是，实际的图像不属于ImageNet。自2010年以来，ImageNet项目每年举办一次软件比赛，即ImageNet大规模视觉识别挑战赛（ILSVRC），软件程序竞相正确分类检测物体和场景。 ImageNet挑战使用了一个“修剪”的1000个非重叠类的列表。2012年在解决ImageNet挑战方面取得了巨大的突破，被广泛认为是2010年的深度学习革命的开始。

【CVPR2022】CAT-Det:用于多模态三维物体检测的对比增强Transformer

专知会员服务

19+阅读 · 2022年4月7日

【CVPR 2022】使用多模态Transformer的端到端视频对象分割，End-to-End Referring Video Object Segmentation with Multimodal Transformer

专知会员服务

28+阅读 · 2022年3月3日