Space-time self-similarity (STSS), which captures visual correspondences across frames, provides an effective way to represent temporal dynamics for video understanding. In this work, we explore higher-order STSS and demonstrate how STSSs at different orders reveal distinct aspects of these dynamics. We then introduce the Multi-Order Self-Similarity (MOSS) module, a lightweight neural module designed to learn and integrate multi-order STSS features. It can be applied to diverse video tasks to enhance motion modeling capabilities while consuming only marginal computational cost and memory usage. Extensive experiments on video action recognition, motion-centric video VQA, and real-world robotic tasks consistently demonstrate substantial improvements, validating the broad applicability of MOSS as a general temporal modeling module. The source code and checkpoints will be publicly available.
翻译:时空自相似性(STSS)通过捕获帧间的视觉对应关系,为视频理解中的时间动态表示提供了有效途径。本文探索了高阶STSS,并展示了不同阶次的STSS如何揭示这些动态的不同方面。随后,我们提出了多阶自相似性(MOSS)模块——一种轻量级神经模块,旨在学习并整合多阶STSS特征。该模块可应用于多种视频任务以增强运动建模能力,且仅消耗极少的计算成本和内存占用。在视频动作识别、运动中心视频VQA以及真实机器人任务上的大量实验一致性地证明了显著改进,验证了MOSS作为通用时间建模模块的广泛适用性。源代码和模型权重将公开发布。