Product image segmentation is vital in e-commerce. Most existing methods extract the product image foreground only based on the visual modality, making it difficult to distinguish irrelevant products. As product titles contain abundant appearance information and provide complementary cues for product image segmentation, we propose a mutual query network to segment products based on both visual and linguistic modalities. First, we design a language query vision module to obtain the response of language description in image areas, thus aligning the visual and linguistic representations across modalities. Then, a vision query language module utilizes the correlation between visual and linguistic modalities to filter the product title and effectively suppress the content irrelevant to the vision in the title. To promote the research in this field, we also construct a Multi-Modal Product Segmentation dataset (MMPS), which contains 30,000 images and corresponding titles. The proposed method significantly outperforms the state-of-the-art methods on MMPS.
翻译:产品图像分割在电子商务中至关重要。现有方法大多仅基于视觉模态提取产品图像前景,导致难以区分无关产品。由于产品标题包含丰富的视觉信息,并为产品图像分割提供互补线索,我们提出了一种互查询网络,基于视觉和语言两种模态进行产品分割。首先,我们设计了一个语言查询视觉模块,获取语言描述在图像区域中的响应,从而跨模态对齐视觉和语言表示。然后,视觉查询语言模块利用视觉与语言模态之间的相关性,过滤产品标题,有效抑制标题中与视觉无关的内容。为促进该领域研究,我们还构建了一个多模态产品分割数据集(MMPS),包含3万张图像及对应标题。所提出的方法在MMPS上显著优于现有最优方法。