Benchmarking Multimodal Large Language Models for Missing Modality Completion in Product Catalogues

Missing-modality information on e-commerce platforms, such as absent product images or textual descriptions, often arises from annotation errors or incomplete metadata, impairing both product presentation and downstream applications such as recommendation systems. Motivated by the multimodal generative capabilities of recent Multimodal Large Language Models (MLLMs), this work investigates a fundamental yet underexplored question: can MLLMs generate missing modalities for products in e-commerce scenarios? We propose the Missing Modality Product Completion Benchmark (MMPCBench), which consists of two sub-benchmarks: a Content Quality Completion Benchmark and a Recommendation Benchmark. We further evaluate six state-of-the-art MLLMs from the Qwen2.5-VL and Gemma-3 model families across nine real-world e-commerce categories, focusing on image-to-text and text-to-image completion tasks. Experimental results show that while MLLMs can capture high-level semantics, they struggle with fine-grained word-level and pixel- or patch-level alignment. In addition, performance varies substantially across product categories and model scales, and we observe no trivial correlation between model size and performance, in contrast to trends commonly reported in mainstream benchmarks. We also explore Group Relative Policy Optimization (GRPO) to better align MLLMs with this task. GRPO improves image-to-text completion but does not yield gains for text-to-image completion. Overall, these findings expose the limitations of current MLLMs in real-world cross-modal generation and represent an early step toward more effective missing-modality product completion.

翻译：电子商务平台上的缺失模态信息（如缺失的商品图像或文本描述）通常源于标注错误或不完整的元数据，这不仅损害了商品展示效果，也影响了推荐系统等下游应用。受近期多模态大语言模型（MLLMs）多模态生成能力的启发，本研究探讨了一个基础但尚未被充分探索的问题：MLLMs能否为电子商务场景中的商品生成缺失的模态？我们提出了缺失模态商品补全基准（MMPCBench），该基准包含两个子基准：内容质量补全基准和推荐基准。我们进一步评估了来自Qwen2.5-VL和Gemma-3模型家族的六个最先进的MLLM，覆盖九个现实世界的电子商务类别，重点关注图像到文本和文本到图像的补全任务。实验结果表明，虽然MLLMs能够捕捉高层次语义，但它们在细粒度的词级以及像素级或块级对齐方面存在困难。此外，不同产品类别和模型规模之间的性能差异显著，并且我们观察到模型大小与性能之间没有简单的相关性，这与主流基准测试中通常报告的趋势相反。我们还探索了使用组相对策略优化（GRPO）来更好地使MLLMs与此任务对齐。GRPO改善了图像到文本的补全，但并未在文本到图像的补全中带来增益。总体而言，这些发现揭示了当前MLLMs在现实世界跨模态生成中的局限性，并代表了迈向更有效的缺失模态商品补全的早期一步。