Temporal information has long been considered to be essential for perception. While there is extensive research on the role of image information for perceptual tasks, the role of the temporal dimension remains less well understood: What can we learn about the world from long-term motion information? What properties does long-term motion information have for visual learning? We leverage recent success in point-track estimation, which offers an excellent opportunity to learn temporal representations and experiment on a variety of perceptual tasks. We draw 3 clear lessons: 1) Long-term motion representations contain information to understand actions, but also objects, materials, and spatial information, often even better than images. 2) Long-term motion representations generalize far better than image representations in low-data settings and in zero-shot tasks. 3) The very low dimensionality of motion information makes motion representations a better trade-off between GFLOPs and accuracy than standard video representations, and used together they achieve higher performance than video representations alone. We hope these insights will pave the way for the design of future models that leverage the power of long-term motion information for perception.
翻译:长期以来,时间信息一直被认为是感知的关键要素。尽管已有大量研究探讨图像信息在感知任务中的作用,但时间维度的作用仍未被充分理解:我们能从长期运动信息中获取关于世界的何种知识?长期运动信息对视觉学习具有哪些特性?我们借助近期在点轨迹估计方面取得的成功——这为学习时序表征提供了绝佳机会——在多种感知任务上进行了实验。我们得出三点明确结论:1)长期运动表征不仅包含理解动作的信息,还能表征物体、材质及空间信息,其效果通常甚至优于图像。2)在低数据场景和零样本任务中,长期运动表征的泛化能力远强于图像表征。3)运动信息的极低维度特性使其在GFLOPs与精度之间取得了比标准视频表征更优的平衡,二者结合使用时能达到比单独使用视频表征更高的性能。我们希望这些发现能为未来模型的设计指明方向,以充分发挥长期运动信息在感知任务中的潜力。