In this work we propose a simple unsupervised approach for next frame prediction in video. Instead of directly predicting the pixels in a frame given past frames, we predict the transformations needed for generating the next frame in a sequence, given the transformations of the past frames. This leads to sharper results, while using a smaller prediction model. In order to enable a fair comparison between different video frame prediction models, we also propose a new evaluation protocol. We use generated frames as input to a classifier trained with ground truth sequences. This criterion guarantees that models scoring high are those producing sequences which preserve discriminative features, as opposed to merely penalizing any deviation, plausible or not, from the ground truth. Our proposed approach compares favourably against more sophisticated ones on the UCF-101 data set, while also being more efficient in terms of the number of parameters and computational cost.
翻译:本文提出了一种简单的无监督方法用于视频中的下一帧预测。不同于根据过去帧直接预测帧中的像素,我们根据过去帧的变换来预测生成序列中下一帧所需的变换。这使得在使用较小预测模型的同时,得到更清晰的结果。为了在不同视频帧预测模型之间进行公平比较,我们还提出了一种新的评估协议。我们将生成的帧作为输入,提供给一个利用真实序列训练的分类器。这一标准确保高分模型能够生成保留判别特征的序列,而不是仅仅惩罚任何与真实序列的偏差(无论合理与否)。我们提出的方法在UCF-101数据集上与更复杂的方法相比表现更优,同时在参数数量和计算成本方面也更为高效。