Generative retrieval, which has demonstrated effectiveness in text-to-text retrieval, utilizes a sequence-to-sequence model to directly generate candidate identifiers based on natural language queries. Without explicitly computing the similarity between queries and candidates, generative retrieval surpasses dual-tower models in both speed and accuracy on large-scale corpora, providing new insights for cross-modal retrieval. However, constructing identifiers for multimodal data remains an untapped problem, and the modality gap between natural language queries and multimodal candidates hinders retrieval performance due to the absence of additional encoders. To this end, we propose a pioneering generAtive Cross-modal rEtrieval framework (ACE), which is a comprehensive framework for end-to-end cross-modal retrieval based on coarse-to-fine semantic modeling. We propose combining K-Means and RQ-VAE to construct coarse and fine tokens, serving as identifiers for multimodal data. Correspondingly, we design the coarse-to-fine feature fusion strategy to efficiently align natural language queries and candidate identifiers. ACE is the first work to comprehensively demonstrate the feasibility of generative approach on text-to-image/audio/video retrieval, challenging the dominance of the embedding-based dual-tower architecture. Extensive experiments show that ACE achieves state-of-the-art performance in cross-modal retrieval and outperforms the strong baselines on Recall@1 by 15.27% on average.
翻译:生成式检索在文本到文本检索中已展现出显著效果,其利用序列到序列模型直接根据自然语言查询生成候选标识符。该方法无需显式计算查询与候选之间的相似度,在大规模语料库上检索速度和精度均超越双塔模型,为跨模态检索提供了新思路。然而,为多模态数据构建标识符仍是一个尚未解决的问题,且由于缺乏额外编码器,自然语言查询与多模态候选之间的模态差异会阻碍检索性能。为此,我们提出了一种开创性的生成式跨模态检索框架(ACE),这是一个基于粗粒度到细粒度语义建模的端到端跨模态检索综合框架。我们提出结合K-Means与RQ-VAE构建粗粒度与细粒度标记,作为多模态数据的标识符。相应地,我们设计了粗粒度到细粒度特征融合策略,以高效对齐自然语言查询与候选标识符。ACE是首个全面论证生成式方法在文本到图像/音频/视频检索中可行性的工作,对基于嵌入的双塔架构的主导地位提出了挑战。大量实验表明,ACE在跨模态检索中取得了最先进的性能,在Recall@1指标上平均超越强基线15.27%。