Efficient large-scale retrieval requires representations that are both compact and discriminative. Foundation models provide powerful visual and multimodal embeddings, but nearest neighbor search in these high-dimensional spaces is computationally expensive. Hashing offers an efficient alternative by enabling fast Hamming distance search with binary codes, yet existing approaches often rely on complex pipelines, multi-term objectives, designs specialized for a single learning paradigm, and long training times. We introduce CroVCA (Cross-View Code Alignment), a simple and unified principle for learning binary codes that remain consistent across semantically aligned views. A single binary cross-entropy loss enforces alignment, while coding-rate maximization serves as an anti-collapse regularizer to promote balanced and diverse codes. To implement this, we design HashCoder, a lightweight MLP hashing network with a final batch normalization layer to enforce balanced codes. HashCoder can be used as a probing head on frozen embeddings or to adapt encoders efficiently via LoRA fine-tuning. Across benchmarks, CroVCA achieves state-of-the-art results in just 5 training epochs. At 16 bits, it performs particularly well; for instance, unsupervised hashing on COCO completes in under 2 minutes and supervised hashing on ImageNet100 in about 3 minutes on a single GPU. These results highlight CroVCA's efficiency, adaptability, and broad applicability.
翻译:大规模高效检索需要兼具紧凑性与判别性的表示。基础模型提供了强大的视觉与多模态嵌入,但在这些高维空间中进行近邻搜索的计算成本过高。哈希通过二进制码实现快速汉明距离搜索,提供了一种高效替代方案,然而现有方法往往依赖复杂流程、多目标函数、针对单一学习范式设计的专用架构以及较长的训练时间。我们提出CroVCA(跨视图编码对齐),这是一种简单统一的二进制码学习准则,确保语义对齐视图间的编码一致性。单个二进制交叉熵损失实现编码对齐,而编码率最大化则作为抗坍缩正则化项,以促进编码的平衡性与多样性。为实现该准则,我们设计了HashCoder——一种轻量级MLP哈希网络,其末端批归一化层用于强制生成平衡编码。HashCoder可作为探测头应用于冻结嵌入,或通过LoRA微调高效适配编码器。在多个基准测试中,CroVCA仅需5个训练周期即可达到最优结果。当编码长度为16比特时性能尤为突出:例如,在单GPU上,COCO数据集的无监督哈希任务可在2分钟内完成,ImageNet100数据集的有监督哈希任务约需3分钟。这些结果充分彰显了CroVCA的高效性、适应性与广泛适用性。