In e-commerce, accurately extracting product attribute values from multimodal data is crucial for improving user experience and operational efficiency of retailers. However, previous approaches to multimodal attribute value extraction often struggle with implicit attribute values embedded in images or text, rely heavily on extensive labeled data, and can easily confuse similar attribute values. To address these issues, we introduce EIVEN, a data- and parameter-efficient generative framework that pioneers the use of multimodal LLM for implicit attribute value extraction. EIVEN leverages the rich inherent knowledge of a pre-trained LLM and vision encoder to reduce reliance on labeled data. We also introduce a novel Learning-by-Comparison technique to reduce model confusion by enforcing attribute value comparison and difference identification. Additionally, we construct initial open-source datasets for multimodal implicit attribute value extraction. Our extensive experiments reveal that EIVEN significantly outperforms existing methods in extracting implicit attribute values while requiring less labeled data.
翻译:在电子商务中,从多模态数据中准确提取产品属性值对于提升用户体验和零售商运营效率至关重要。然而,以往的多模态属性值提取方法往往难以处理嵌入在图像或文本中的隐式属性值,严重依赖大量标注数据,且容易混淆相似的属性值。针对这些问题,我们提出了EIVEN,一种数据与参数高效的生成式框架,率先将多模态大语言模型应用于隐式属性值提取。EIVEN利用预训练大语言模型和视觉编码器的丰富固有知识,减少了对标注数据的依赖。我们还引入了一种新颖的“对比学习”技术,通过强制进行属性值比较与差异识别来降低模型混淆。此外,我们构建了首个用于多模态隐式属性值提取的开源数据集。大量实验表明,EIVEN在提取隐式属性值时显著优于现有方法,同时所需标注数据更少。