In this paper, we study the transferability of ImageNet spatial and Kinetics spatio-temporal representations to multi-label Movie Trailer Genre Classification (MTGC). In particular, we present an extensive evaluation of the transferability of ConvNet and Transformer models pretrained on ImageNet and Kinetics to Trailers12k, a new manually-curated movie trailer dataset composed of 12,000 videos labeled with 10 different genres and associated metadata. We analyze different aspects that can influence transferability, such as frame rate, input video extension, and spatio-temporal modeling. In order to reduce the spatio-temporal structure gap between ImageNet/Kinetics and Trailers12k, we propose Dual Image and Video Transformer Architecture (DIViTA), which performs shot detection so as to segment the trailer into highly correlated clips, providing a more cohesive input for pretrained backbones and improving transferability (a 1.83% increase for ImageNet and 3.75% for Kinetics). Our results demonstrate that representations learned on either ImageNet or Kinetics are comparatively transferable to Trailers12k. Moreover, both datasets provide complementary information that can be combined to improve classification performance (a 2.91% gain compared to the top single pretraining). Interestingly, using lightweight ConvNets as pretrained backbones resulted in only a 3.46% drop in classification performance compared with the top Transformer while requiring only 11.82% of its parameters and 0.81% of its FLOPS.
翻译:本文研究了ImageNet空间表示与Kinetics时空表示向多标签电影预告片分类(MTGC)任务的迁移性能。我们特别评估了在ImageNet和Kinetics上预训练的ConvNet与Transformer模型向Trailers12k(一个全新手工标注的电影预告片数据集,包含12,000个视频,标注有10种类型及元数据)的迁移效果,并分析了影响迁移性能的多种因素,如帧率、输入视频扩展方式及时空建模策略。为缩小ImageNet/Kinetics与Trailers12k之间的时空结构差异,我们提出双图像与视频Transformer架构(DIViTA),该架构通过镜头检测将预告片分割为高度相关的片段,为预训练骨干网络提供更连贯的输入,从而提升迁移性能(ImageNet提升1.83%,Kinetics提升3.75%)。实验结果表明,在ImageNet或Kinetics上学习的表示对Trailers12k具有相当的迁移能力,且两个数据集提供的互补信息可融合以提升分类性能(相比最优单一预训练方式提升2.91%)。值得注意的是,采用轻量级ConvNet作为预训练骨干网络时,相比顶级Transformer模型仅产生3.46%的分类性能下降,但其参数量仅为后者的11.82%,计算量仅为0.81%。