The multi-modal hashing method is widely used in multimedia retrieval. It can fuse multi-source data to generate binary hash code. However, the current multi-modal methods have the problem of low retrieval accuracy. The reason is that the individual backbone networks have limited feature expression capabilities and are not jointly pre-trained on large-scale unsupervised multi-modal data. To solve this problem, we propose a new baseline CLIP Multi-modal Hashing (CLIPMH) method. It uses CLIP model to extract text and image features, and then fuse to generate hash code. CLIP improves the expressiveness of each modal feature. In this way, it can greatly improve the retrieval performance of multi-modal hashing methods. In comparison to state-of-the-art unsupervised and supervised multi-modal hashing methods, experiments reveal that the proposed CLIPMH can significantly enhance performance (Maximum increase of 8.38%). CLIP also has great advantages over the text and visual backbone networks commonly used before.
翻译:多模态哈希方法广泛应用于多媒体检索领域,能够融合多源数据生成二进制哈希码。然而,当前多模态方法存在检索精度低的问题,原因在于各骨干网络特征表达能力有限,且未在大规模无监督多模态数据上进行联合预训练。为解决该问题,我们提出一种新的基线方法CLIP多模态哈希(CLIPMH)。该方法利用CLIP模型提取文本与图像特征,进而融合生成哈希码。CLIP提升了各模态特征的表征能力,从而显著改善多模态哈希方法的检索性能。实验表明,与现有最优的无监督及有监督多模态哈希方法相比,所提CLIPMH方法能显著提升性能(最大提升8.38%)。此外,CLIP相较于以往常用的文本及视觉骨干网络也展现出显著优势。