Compressing videos into binary codes can improve retrieval speed and reduce storage overhead. However, learning accurate hash codes for video retrieval can be challenging due to high local redundancy and complex global dependencies between video frames, especially in the absence of labels. Existing self-supervised video hashing methods have been effective in designing expressive temporal encoders, but have not fully utilized the temporal dynamics and spatial appearance of videos due to less challenging and unreliable learning tasks. To address these challenges, we begin by utilizing the contrastive learning task to capture global spatio-temporal information of videos for hashing. With the aid of our designed augmentation strategies, which focus on spatial and temporal variations to create positive pairs, the learning framework can generate hash codes that are invariant to motion, scale, and viewpoint. Furthermore, we incorporate two collaborative learning tasks, i.e., frame order verification and scene change regularization, to capture local spatio-temporal details within video frames, thereby enhancing the perception of temporal structure and the modeling of spatio-temporal relationships. Our proposed Contrastive Hashing with Global-Local Spatio-temporal Information (CHAIN) outperforms state-of-the-art self-supervised video hashing methods on four video benchmark datasets. Our codes will be released.
翻译:将视频压缩为二进制编码可提升检索速度并降低存储开销。然而,由于视频帧间存在高度局部冗余和复杂全局依赖关系,尤其在无标签条件下,学习准确的视频哈希编码仍具挑战性。现有自监督视频哈希方法虽在设计表达性时间编码器方面取得成效,但因学习任务挑战性不足且不可靠,未能充分利用视频的时间动态与空间表观特征。针对上述问题,本文首先利用对比学习任务捕获视频全局时空信息用于哈希编码。通过设计聚焦空间与时间变化的增强策略构建正样本对,学习框架可生成对运动、尺度和视角具有不变性的哈希编码。此外,我们引入两种协作学习任务——帧序验证与场景变化正则化,以捕获视频帧内的局部时空细节,从而增强时序结构感知与时空关系建模能力。所提出的基于全局-局部时空信息的对比哈希方法(CHAIN),在四个视频基准数据集上均优于现有最先进的自监督视频哈希方法。相关代码将公开。