Video prediction is a complex time-series forecasting task with great potential in many use cases. However, traditional methods prioritize accuracy and overlook slow prediction speeds due to complex model structures, redundant information, and excessive GPU memory consumption. These methods often predict frames sequentially, making acceleration difficult and limiting their applicability in real-time scenarios like danger prediction and warning.Therefore, we propose a transformer-based keypoint prediction neural network (TKN). TKN extracts dynamic content from video frames in an unsupervised manner, reducing redundant feature computation. And, TKN uses an acceleration matrix to reduce the computational cost of attention and employs a parallel computing structure for prediction acceleration. To the best of our knowledge, TKN is the first real-time video prediction solution that achieves a prediction rate of 1,176 fps, significantly reducing computation costs while maintaining other performance. Qualitative and quantitative experiments on multiple datasets have demonstrated the superiority of our method, suggesting that TKN has great application potential.
翻译:视频预测是一项复杂的时间序列预测任务,在许多应用场景中具有巨大潜力。然而,传统方法因模型结构复杂、信息冗余和GPU内存消耗过大而优先考虑精度,却忽视了缓慢的预测速度。这些方法通常逐帧顺序预测,难以加速,限制了其在危险预测与预警等实时场景中的适用性。为此,我们提出了一种基于Transformer的关键点预测神经网络(TKN)。TKN以无监督方式从视频帧中提取动态内容,减少了冗余特征计算。同时,TKN采用加速矩阵降低注意力计算成本,并利用并行计算结构实现预测加速。据我们所知,TKN是首个实现1,176 fps预测速率的实时视频预测方案,在保持其他性能的同时显著降低了计算成本。在多个数据集上的定性与定量实验证明了我们方法的优越性,表明TKN具有巨大的应用潜力。