Video prediction is a complex time-series forecasting task with great potential in many use cases. However, conventional methods overemphasize accuracy while ignoring the slow prediction speed caused by complicated model structures that learn too much redundant information with excessive GPU memory consumption. Furthermore, conventional methods mostly predict frames sequentially (frame-by-frame) and thus are hard to accelerate. Consequently, valuable use cases such as real-time danger prediction and warning cannot achieve fast enough inference speed to be applicable in reality. Therefore, we propose a transformer-based keypoint prediction neural network (TKN), an unsupervised learning method that boost the prediction process via constrained information extraction and parallel prediction scheme. TKN is the first real-time video prediction solution to our best knowledge, while significantly reducing computation costs and maintaining other performance. Extensive experiments on KTH and Human3.6 datasets demonstrate that TKN predicts 11 times faster than existing methods while reducing memory consumption by 17.4% and achieving state-of-the-art prediction performance on average.
翻译:摘要:视频预测是一项复杂的时间序列预测任务,在许多应用场景中具有巨大潜力。然而,传统方法过度强调准确性,却忽视了因复杂模型结构导致预测速度缓慢的问题——这些结构学习了过多冗余信息,且消耗大量GPU内存。此外,传统方法大多逐帧顺序预测,因而难以实现加速。这使得实时危险预测与预警等有价值的应用场景无法获得足够快的推理速度以在实际中落地。为此,我们提出一种基于Transformer的关键点预测神经网络(TKN),这是一种通过约束信息提取与并行预测方案来加速预测过程的无监督学习方法。据我们所知,TKN是首个实时视频预测解决方案,在显著降低计算成本的同时保持了其他性能指标。在KTH和Human3.6数据集上的大量实验表明,TKN的预测速度比现有方法快11倍,内存消耗降低17.4%,并平均实现了最先进的预测性能。