Text-to-image retrieval is a fundamental task in multimedia processing, aiming to retrieve semantically relevant cross-modal content. Traditional studies have typically approached this task as a discriminative problem, matching the text and image via the cross-attention mechanism (one-tower framework) or in a common embedding space (two-tower framework). Recently, generative cross-modal retrieval has emerged as a new research line, which assigns images with unique string identifiers and generates the target identifier as the retrieval target. Despite its great potential, existing generative approaches are limited due to the following issues: insufficient visual information in identifiers, misalignment with high-level semantics, and learning gap towards the retrieval target. To address the above issues, we propose an autoregressive voken generation method, named AVG. AVG tokenizes images into vokens, i.e., visual tokens, and innovatively formulates the text-to-image retrieval task as a token-to-voken generation problem. AVG discretizes an image into a sequence of vokens as the identifier of the image, while maintaining the alignment with both the visual information and high-level semantics of the image. Additionally, to bridge the learning gap between generative training and the retrieval target, we incorporate discriminative training to modify the learning direction during token-to-voken training. Extensive experiments demonstrate that AVG achieves superior results in both effectiveness and efficiency.
翻译:文本到图像检索是多媒体处理中的一项基础任务,旨在检索语义相关的跨模态内容。传统研究通常将该任务视为判别性问题,通过交叉注意力机制(单塔框架)或在共同嵌入空间(双塔框架)中匹配文本与图像。近年来,生成式跨模态检索作为一种新的研究方向兴起,该方法为图像分配唯一的字符串标识符,并通过生成目标标识符作为检索目标。尽管具有巨大潜力,现有生成式方法仍受限于以下问题:标识符中视觉信息不足、与高层语义未对齐,以及面向检索目标的学习差距。为解决上述问题,我们提出了一种自回归视觉标记生成方法,命名为AVG。AVG将图像分词为视觉标记,并创新性地将文本到图像检索任务构建为标记到视觉标记的生成问题。AVG将图像离散化为视觉标记序列作为图像标识符,同时保持与图像视觉信息及高层语义的对齐。此外,为弥合生成式训练与检索目标之间的学习差距,我们在标记到视觉标记训练中引入判别式训练以修正学习方向。大量实验表明,AVG在效果与效率方面均取得了卓越成果。