Multimodal representation is crucial for E-commerce tasks such as identical product retrieval. Large representation models (e.g., VLM2Vec) demonstrate strong multimodal understanding capabilities, yet they struggle with fine-grained semantic comprehension, which is essential for distinguishing highly similar items. To address this, we propose Attribute-Enhanced Fine-Grained Multi-Modal Representation Learning (AFMRL), which defines product fine-grained understanding as an attribute generation task. It leverages the generative power of Multimodal Large Language Models (MLLMs) to extract key attributes from product images and text, and enhances representation learning through a two-stage training framework: 1) Attribute-Guided Contrastive Learning (AGCL), where the key attributes generated by the MLLM are used in the image-text contrastive learning training process to identify hard samples and filter out noisy false negatives. 2) Retrieval-aware Attribute Reinforcement (RAR), where the improved retrieval performance of the representation model post-attribute integration serves as a reward signal to enhance MLLM's attribute generation during multimodal fine-tuning. Extensive experiments on large-scale E-commerce datasets demonstrate that our method achieves state-of-the-art performance on multiple downstream retrieval tasks, validating the effectiveness of harnessing generative models to advance fine-grained representation learning.
翻译:多模态表示对于电商任务(如相同商品检索)至关重要。大型表示模型(如VLM2Vec)展现出强大的多模态理解能力,但在区分高度相似商品所需的细粒度语义理解方面存在不足。为解决该问题,我们提出属性增强的细粒度多模态表示学习(AFMRL),将产品细粒度理解定义为属性生成任务。该方法利用多模态大语言模型(MLLMs)的生成能力从商品图像和文本中提取关键属性,并通过两阶段训练框架增强表示学习:1)属性引导的对比学习(AGCL),其中MLLM生成的关键属性用于图像-文本对比学习训练过程,以识别难样本并过滤噪声假负例;2)检索感知的属性增强(RAR),将属性集成后表示模型改进的检索性能作为奖励信号,在多模态微调期间增强MLLM的属性生成能力。在大规模电商数据集上的大量实验证明,我们的方法在多个下游检索任务中取得了最先进性能,验证了利用生成模型推进细粒度表示学习的有效性。