Recent Multimodal Large Language Models (MLLMs) have significantly advanced e-commerce product understanding. However, they still face three challenges: (i) the modality imbalance induced by modality mixed training; (ii) underutilization of the intrinsic alignment relationships among visual and textual information within a product; and (iii) limited handling of noise in e-commerce multimodal data. To address these, we propose MOON2.0, a dynamic modality-balanced MultimOdal representation learning framework for e-commerce prOduct uNderstanding. It comprises: (1) a Modality-driven Mixture-of-Experts (MoE) that adaptively processes input samples by their modality composition, enabling Multimodal Joint Learning to mitigate the modality imbalance; (2) a Dual-level Alignment method to better leverage semantic alignment properties inside individual products; and (3) an MLLM-based Image-text Co-augmentation strategy that integrates textual enrichment with visual expansion, coupled with Dynamic Sample Filtering to improve training data quality. We further release MBE2.0, a co-augmented Multimodal representation Benchmark for E-commerce representation learning and evaluation at https://huggingface.co/datasets/ZHNie/MBE2.0. Experiments show that MOON2.0 delivers state-of-the-art zero-shot performance on MBE2.0 and multiple public datasets. Furthermore, attention-based heatmap visualization provides qualitative evidence of improved multimodal alignment of MOON2.0.
翻译:近期多模态大语言模型显著推动了电商商品理解的发展,但仍面临三个挑战:(i) 模态混合训练导致的模态失衡;(ii) 商品内视觉与文本信息固有对齐关系未被充分利用;(iii) 对电商多模态数据噪声的处理能力有限。针对这些问题,我们提出MOON2.0——一种面向电商商品理解的动态模态均衡多模态表示学习框架。该框架包含:(1) 模态驱动混合专家模块,根据输入样本的模态构成进行自适应处理,实现多模态联合学习以缓解模态失衡;(2) 双层级对齐方法,更有效地利用单个商品内部的语义对齐特性;(3) 基于多模态大语言模型的图像文本协同增强策略,将文本丰富化与视觉扩展相结合,并辅以动态样本过滤以提升训练数据质量。我们进一步发布MBE2.0——经协同增强的多模态表示学习与评估基准,详见https://huggingface.co/datasets/ZHNie/MBE2.0。实验表明,MOON2.0在MBE2.0及多个公开数据集上取得最优零样本性能。此外,基于注意力机制的热力图可视化从定性角度证实了MOON2.0在多模态对齐方面的显著改进。