MOON: Generative MLLM-based Multimodal Representation Learning for E-commerce Product Understanding

With the rapid advancement of e-commerce, exploring general representations rather than task-specific ones has attracted increasing research attention. For product understanding, although existing discriminative dual-flow architectures drive progress in this field, they inherently struggle to model the many-to-one alignment between multiple images and texts of products. Therefore, we argue that generative Multimodal Large Language Models (MLLMs) hold significant potential for improving product representation learning. Nevertheless, achieving this goal still remains non-trivial due to several key challenges: the lack of multimodal and aspect-aware modeling modules in typical LLMs; the common presence of background noise in product images; and the absence of a standard benchmark for evaluation. To address these issues, we propose the first generative MLLM-based model named MOON for product representation learning. Our method (1) employs a guided Mixture-of-Experts (MoE) module for targeted modeling of multimodal and aspect-specific product content; (2) effectively detects core semantic regions in product images to mitigate the distraction and interference caused by background noise; and (3) introduces the specialized negative sampling strategy to increase the difficulty and diversity of negative samples. In addition, we release a large-scale multimodal benchmark MBE for various product understanding tasks. Experimentally, our model demonstrates competitive zero-shot performance on both our benchmark and the public dataset, showcasing strong generalization across various downstream tasks, including cross-modal retrieval, product classification, and attribute prediction. Furthermore, the case study and visualization illustrate the effectiveness of MOON for product understanding. The data of our MBE benchmark is given in https://huggingface.co/datasets/Daoze/MM-Bench-E-Commerce.

翻译：随着电子商务的迅猛发展，探索通用表征而非任务特定表征已引起越来越多的研究关注。在商品理解领域，尽管现有的判别式双流架构推动了该领域的进展，但它们本质上难以建模商品多张图像与文本之间的多对一对应关系。因此，我们认为生成式多模态大语言模型在改进商品表征学习方面具有巨大潜力。然而，由于几个关键挑战，实现这一目标仍非易事：典型大语言模型中缺乏多模态和面向特定方面的建模模块；商品图像中普遍存在背景噪声；以及缺乏标准的评估基准。为解决这些问题，我们提出了首个基于生成式多模态大语言模型的商品表征学习模型MOON。我们的方法（1）采用引导式混合专家模块，对多模态及面向特定方面的商品内容进行针对性建模；（2）有效检测商品图像中的核心语义区域，以减轻背景噪声带来的干扰；（3）引入专门的负采样策略，以增加负样本的难度和多样性。此外，我们发布了一个用于多种商品理解任务的大规模多模态基准MBE。实验表明，我们的模型在我们构建的基准和公共数据集上均展现出具有竞争力的零样本性能，并在跨模态检索、商品分类和属性预测等多种下游任务中表现出强大的泛化能力。此外，案例研究和可视化结果证明了MOON在商品理解方面的有效性。我们的MBE基准数据发布于 https://huggingface.co/datasets/Daoze/MM-Bench-E-Commerce。