Generative cross-modal retrieval, which treats retrieval as a generation task, has emerged as a promising direction with the rise of Multimodal Large Language Models (MLLMs). In this setting, the model responds to a text query by generating an identifier corresponding to the target image. However, existing methods typically rely on manually crafted string IDs, clustering-based labels, or atomic identifiers requiring vocabulary expansion, all of which face challenges in semantic alignment or scalability.To address these limitations, we propose a vocabulary-efficient identifier generation framework that prompts MLLMs to generate Structured Semantic Identifiers from image-caption pairs. These identifiers are composed of concept-level tokens such as objects and actions, naturally aligning with the model's generation space without modifying the tokenizer. Additionally, we introduce a Rationale-Guided Supervision Strategy, prompting the model to produce a one-sentence explanation alongside each identifier serves as an auxiliary supervision signal that improves semantic grounding and reduces hallucinations during training.
翻译:生成式跨模态检索将检索任务视为生成任务,随着多模态大语言模型(MLLMs)的兴起,已成为一个前景广阔的研究方向。在此框架下,模型通过生成与目标图像对应的标识符来响应文本查询。然而,现有方法通常依赖人工设计的字符串ID、基于聚类的标签或需要词汇扩展的原子标识符,这些方法在语义对齐或可扩展性方面均面临挑战。为克服这些局限,我们提出了一种词汇高效的标识符生成框架,通过提示MLLMs从图像-描述对中生成结构化语义标识符。这些标识符由对象、动作等概念级标记组成,无需修改分词器即可自然对齐模型的生成空间。此外,我们引入了原理引导监督策略,在训练过程中提示模型为每个标识符生成一句解释作为辅助监督信号,以增强语义基础并减少幻觉现象。