Self-supervised video hashing (SSVH) is a practical task in video indexing and retrieval. Although Transformers are predominant in SSVH for their impressive temporal modeling capabilities, they often suffer from computational and memory inefficiencies. Drawing inspiration from Mamba, an advanced state-space model, we explore its potential in SSVH to achieve a better balance between efficacy and efficiency. We introduce S5VH, a Mamba-based video hashing model with an improved self-supervised learning paradigm. Specifically, we design bidirectional Mamba layers for both the encoder and decoder, which are effective and efficient in capturing temporal relationships thanks to the data-dependent selective scanning mechanism with linear complexity. In our learning strategy, we transform global semantics in the feature space into semantically consistent and discriminative hash centers, followed by a center alignment loss as a global learning signal. Our self-local-global (SLG) paradigm significantly improves learning efficiency, leading to faster and better convergence. Extensive experiments demonstrate S5VH's improvements over state-of-the-art methods, superior transferability, and scalable advantages in inference efficiency. Code is available at https://github.com/gimpong/AAAI25-S5VH.
翻译:自监督视频哈希(SSVH)是视频索引与检索中的一项实用任务。尽管Transformer凭借其出色的时序建模能力在SSVH中占据主导地位,但其常受计算和内存效率低下的困扰。受先进状态空间模型Mamba的启发,我们探索了其在SSVH中的应用潜力,以期在效能与效率之间实现更好的平衡。我们提出了S5VH,一种基于Mamba的视频哈希模型,并改进了其自监督学习范式。具体而言,我们为编码器和解码器设计了双向Mamba层,得益于其具有线性复杂度的数据依赖选择性扫描机制,这些层在捕获时序关系方面既高效又有效。在我们的学习策略中,我们将特征空间中的全局语义转换为语义一致且具有判别性的哈希中心,随后采用中心对齐损失作为全局学习信号。我们的自局部-全局(SLG)范式显著提升了学习效率,实现了更快、更好的收敛。大量实验表明,S5VH相较于现有最先进方法有所改进,具有卓越的可迁移性,并在推理效率上展现出可扩展的优势。代码发布于 https://github.com/gimpong/AAAI25-S5VH。