This paper provides an insight into the possibility of scene recognition from a video sequence with a small set of repeated shooting locations (such as in television series) using artificial neural networks. The basic idea of the presented approach is to select a set of frames from each scene, transform them by a pre-trained singleimage pre-processing convolutional network, and classify the scene location with subsequent layers of the neural network. The considered networks have been tested and compared on a dataset obtained from The Big Bang Theory television series. We have investigated different neural network layers to combine individual frames, particularly AveragePooling, MaxPooling, Product, Flatten, LSTM, and Bidirectional LSTM layers. We have observed that only some of the approaches are suitable for the task at hand.
翻译:本文探讨了利用人工神经网络从重复拍摄地点数量有限(如电视剧中)的视频序列中进行场景识别的可能性。所提方法的核心思路是:从每个场景中选取一组帧,通过预训练的单图像预处理卷积网络对其进行变换,随后利用神经网络的后续层级对场景地点进行分类。我们基于《生活大爆炸》电视剧数据集对所考虑的多种网络进行了测试与比较。我们研究了不同神经网络层(特别是平均池化、最大池化、乘积、展平、LSTM及双向LSTM层)在组合单帧时的表现,并观察到仅有部分方法适用于当前任务。