Deep image hashing aims to map input images into simple binary hash codes via deep neural networks and thus enable effective large-scale image retrieval. Recently, hybrid networks that combine convolution and Transformer have achieved superior performance on various computer tasks and have attracted extensive attention from researchers. Nevertheless, the potential benefits of such hybrid networks in image retrieval still need to be verified. To this end, we propose a hybrid convolutional and self-attention deep hashing method known as HybridHash. Specifically, we propose a backbone network with stage-wise architecture in which the block aggregation function is introduced to achieve the effect of local self-attention and reduce the computational complexity. The interaction module has been elaborately designed to promote the communication of information between image blocks and to enhance the visual representations. We have conducted comprehensive experiments on three widely used datasets: CIFAR-10, NUS-WIDE and IMAGENET. The experimental results demonstrate that the method proposed in this paper has superior performance with respect to state-of-the-art deep hashing methods. Source code is available https://github.com/shuaichaochao/HybridHash.
翻译:深度图像哈希旨在通过深度神经网络将输入图像映射为简单的二进制哈希码,从而实现高效的大规模图像检索。近年来,融合卷积与Transformer的混合网络在各种计算机任务中展现出卓越性能,并引起了研究人员的广泛关注。然而,此类混合网络在图像检索中的潜在优势仍有待验证。为此,我们提出一种名为HybridHash的混合卷积与自注意力深度哈希方法。具体而言,我们设计了一种具有分阶段架构的骨干网络,其中引入块聚合函数以实现局部自注意力效果并降低计算复杂度。交互模块经过精心设计,旨在促进图像块间的信息交流并增强视觉表征。我们在三个广泛使用的数据集(CIFAR-10、NUS-WIDE和IMAGENET)上进行了全面实验,结果表明本文提出的方法相较于现有最先进深度哈希方法具有更优性能。源代码已开源:https://github.com/shuaichaochao/HybridHash。