A Multimodal In-Context Tuning Approach for E-Commerce Product Description Generation

In this paper, we propose a new setting for generating product descriptions from images, augmented by marketing keywords. It leverages the combined power of visual and textual information to create descriptions that are more tailored to the unique features of products. For this setting, previous methods utilize visual and textual encoders to encode the image and keywords and employ a language model-based decoder to generate the product description. However, the generated description is often inaccurate and generic since same-category products have similar copy-writings, and optimizing the overall framework on large-scale samples makes models concentrate on common words yet ignore the product features. To alleviate the issue, we present a simple and effective Multimodal In-Context Tuning approach, named ModICT, which introduces a similar product sample as the reference and utilizes the in-context learning capability of language models to produce the description. During training, we keep the visual encoder and language model frozen, focusing on optimizing the modules responsible for creating multimodal in-context references and dynamic prompts. This approach preserves the language generation prowess of large language models (LLMs), facilitating a substantial increase in description diversity. To assess the effectiveness of ModICT across various language model scales and types, we collect data from three distinct product categories within the E-commerce domain. Extensive experiments demonstrate that ModICT significantly improves the accuracy (by up to 3.3% on Rouge-L) and diversity (by up to 9.4% on D-5) of generated results compared to conventional methods. Our findings underscore the potential of ModICT as a valuable tool for enhancing automatic generation of product descriptions in a wide range of applications. Code is at: https://github.com/HITsz-TMG/Multimodal-In-Context-Tuning

翻译：本文提出了一种由营销关键词增强的、基于图像生成产品描述的新设置。该方法利用视觉和文本信息的组合能力，生成更贴合产品独特特征的描述。针对这一设置，以往的方法使用视觉和文本编码器对图像和关键词进行编码，并采用基于语言模型的解码器生成产品描述。然而，由于同类产品的文案相似，且在大规模样本上优化整体框架会使模型聚焦于常见词汇而忽略产品特征，生成的描述往往不准确且缺乏个性化。为缓解这一问题，我们提出了一种简单有效的多模态上下文调优方法，命名为ModICT。该方法引入相似产品样本作为参考，并利用语言模型的上下文学习能力生成描述。在训练过程中，我们冻结视觉编码器和语言模型，专注于优化负责创建多模态上下文参考和动态提示的模块。这种方法保留了大语言模型（LLMs）的语言生成能力，显著提升了描述的多样性。为评估ModICT在不同语言模型规模和类型上的有效性，我们从电子商务领域的三个不同产品类别中收集数据。大量实验表明，与传统方法相比，ModICT显著提升了生成结果的准确性（在Rouge-L上最高提升3.3%）和多样性（在D-5上最高提升9.4%）。我们的研究结果强调了ModICT作为增强自动生成产品描述工具的潜力，可广泛应用于各类场景。代码地址：https://github.com/HITsz-TMG/Multimodal-In-Context-Tuning