Large-scale instance-level training data is scarce, so models are typically trained on domain-specific datasets. Yet in real-world retrieval, they must handle diverse domains, making generalization to unseen data critical. We introduce ELViS, an image-to-image similarity model that generalizes effectively to unseen domains. Unlike conventional approaches, our model operates in similarity space rather than representation space, promoting cross-domain transfer. It leverages local descriptor correspondences, refines their similarities through an optimal transport step with data-dependent gains that suppress uninformative descriptors, and aggregates strong correspondences via a voting process into an image-level similarity. This design injects strong inductive biases, yielding a simple, efficient, and interpretable model. To assess generalization, we compile a benchmark of eight datasets spanning landmarks, artworks, products, and multi-domain collections, and evaluate ELViS as a re-ranking method. Our experiments show that ELViS outperforms competing methods by a large margin in out-of-domain scenarios and on average, while requiring only a fraction of their computational cost. Code available at: https://github.com/pavelsuma/ELViS/
翻译:大规模实例级训练数据稀缺,因此模型通常基于特定领域数据集训练。然而在实际检索任务中,模型需处理多样化领域,故对未见领域的泛化能力至关重要。本文提出ELViS——一种能有效泛化至未见领域的图像间相似度模型。与传统方法不同,该模型在相似度空间而非表示空间内运作,从而促进跨域迁移。它利用局部描述子对应关系,通过最优传输步骤结合数据依赖增益来抑制非信息性描述子并精炼其相似度,最终通过投票过程将强对应关系聚合为图像级相似度。这一设计注入强归纳偏差,构建出简洁、高效且可解释的模型。为评估泛化性能,我们汇编了涵盖地标、艺术品、商品及多领域数据集的八个基准数据集,并将ELViS作为重排序方法进行评估。实验表明,在域外场景及整体平均表现上,ELViS以远低于竞争方法的计算成本实现大幅性能超越。代码地址:https://github.com/pavelsuma/ELViS/