Recent developments of multi-modal large language models have demonstrated its strong ability in solving vision-language tasks. In this paper, we focus on the product understanding task, which plays an essential role in enhancing online shopping experience. Product understanding task includes a variety of sub-tasks, which require models to respond diverse queries based on multi-modal product information. Traditional methods design distinct model architectures for each sub-task. On the contrary, we present PUMGPT, a large vision-language model aims at unifying all product understanding tasks under a singular model structure. To bridge the gap between vision and text representations, we propose Layer-wise Adapters (LA), an approach that provides enhanced alignment with fewer visual tokens and enables parameter-efficient fine-tuning. Moreover, the inherent parameter-efficient fine-tuning ability allows PUMGPT to be readily adapted to new product understanding tasks and emerging products. We design instruction templates to generate diverse product instruction datasets. Simultaneously, we utilize open-domain datasets during training to improve the performance of PUMGPT and its generalization ability. Through extensive evaluations, PUMGPT demonstrates its superior performance across multiple product understanding tasks, including product captioning, category question-answering, attribute extraction, attribute question-answering, and even free-form question-answering about products.
翻译:近年来,多模态大语言模型的最新发展已展现出其在解决视觉-语言任务方面的强大能力。本文聚焦于产品理解任务,该任务对提升在线购物体验具有关键作用。产品理解任务包含多种子任务,要求模型基于多模态产品信息响应多样化查询。传统方法为每个子任务设计不同的模型架构,而本文提出的PUMGPT作为一种大规模视觉-语言模型,旨在以单一模型结构统一所有产品理解任务。为弥合视觉与文本表征之间的差距,我们提出逐层适配器(LA),这是一种通过更少视觉标记实现增强对齐并支持参数高效微调的方法。此外,其固有的参数高效微调能力使PUMGPT能够便捷地适应新的产品理解任务及新兴产品。我们设计了指令模板以生成多样化的产品指令数据集,同时在训练过程中利用开放域数据集提升PUMGPT的性能及其泛化能力。通过广泛评估,PUMGPT在产品描述生成、类别问答、属性提取、属性问答乃至产品自由形式问答等多类产品理解任务中均展现出卓越性能。