Category-Oriented Representation Learning for Image to Multi-Modal Retrieval

The rise of multi-modal search requests from users has highlighted the importance of multi-modal retrieval (i.e. image-to-text or text-to-image retrieval), yet the more complex task of image-to-multi-modal retrieval, crucial for many industry applications, remains under-explored. To address this gap and promote further research, we introduce and define the concept of Image-to-Multi-Modal Retrieval (IMMR), a process designed to retrieve rich multi-modal (i.e. image and text) documents based on image queries. We focus on representation learning for IMMR and analyze three key challenges for it: 1) skewed data and noisy label in real-world industrial data, 2) the information-inequality between image and text modality of documents when learning representations, 3) effective and efficient training in large-scale industrial contexts. To tackle the above challenges, we propose a novel framework named organizing categories and learning by classification for retrieval (OCLEAR). It consists of three components: 1) a novel category-oriented data governance scheme coupled with a large-scale classification-based learning paradigm, which handles the skewed and noisy data from a data perspective. 2) model architecture specially designed for multi-modal learning, where information-inequality between image and text modality of documents is considered for modality fusion. 3) a hybrid parallel training approach for tackling large-scale training in industrial scenario. The proposed framework achieves SOTA performance on public datasets and has been deployed in a real-world industrial e-commence system, leading to significant business growth. Code will be made publicly available.

翻译：用户对多模态搜索需求的增长凸显了多模态检索（即图像到文本或文本到图像检索）的重要性，然而，对于许多工业应用至关重要的、更为复杂的图像到多模态检索任务仍未得到充分探索。为弥补这一空白并推动进一步研究，我们引入并定义了图像到多模态检索的概念，这是一个旨在基于图像查询检索丰富的多模态（即图像和文本）文档的过程。我们专注于IMMR的表示学习，并分析了其面临的三个关键挑战：1）现实世界工业数据中的偏态数据和噪声标签，2）在学习表示时文档图像与文本模态之间的信息不平等，3）大规模工业场景下有效且高效的训练。为应对上述挑战，我们提出了一个名为“通过组织类别与分类学习进行检索”的新框架。它包含三个组成部分：1）一种新颖的面向类别的数据治理方案，结合大规模基于分类的学习范式，从数据角度处理偏态和噪声数据。2）专为多模态学习设计的模型架构，其中在模态融合时考虑了文档图像与文本模态之间的信息不平等。3）一种混合并行训练方法，用于应对工业场景中的大规模训练。所提出的框架在公共数据集上实现了最先进的性能，并已部署在真实的工业电子商务系统中，带来了显著的业务增长。代码将公开提供。